Final Project : Econometrics of Big Data II

Authors

Yoann PULL & Louis QUENAULT

Master 2 IREF

Published

November 17, 2023

1) AirBnB London : Predicting rental price

This case study centers on predicting rental prices for apartments listed on Airbnb in London. Our aim is to forecast prices based on property features, aiding in various business decisions. This includes estimating revenue from property investments and finding reasonably priced accommodations in different neighborhoods. With a dataset of 51,646 observations, we’ve selected 33 London boroughs as variables, considering a range of factors such as property type, room type, and amenities. We’ll categorize variables as quantitative (prefixed with “n_”) and qualitative (prefixed with “f_”) to create an efficient predictive model.

Code
# load packages

####################################################
#################### Airbnb ####################
####################################################
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
import time
import missingno as msno



from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV,RandomizedSearchCV, KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LassoCV, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers.legacy import Adam 
from tensorflow.keras.regularizers import l2, l1
from keras_tuner.tuners import RandomSearch
from keras_tuner.engine.hyperparameters import HyperParameters
import keras_tuner as kt
from tensorflow.keras.callbacks import EarlyStopping

from deap import base, creator, tools, algorithms

# This will clear our output in the NN part
from IPython.display import clear_output 
from IPython.display import display, Markdown


from scipy import stats
from scipy.stats import chi2_contingency

# This will clear our output in the NN part
from IPython.display import clear_output 

## AirBnB London
from scipy import stats
from scipy.stats import chi2_contingency


# Parameters of our Notebook:
Run_Code = True
seed = 42
cv_fold = 5

# Parameters of our Notebook:
Run_Code = True
seed = 42
cv_fold = 5
Code
# graph features
plt.style.use('seaborn-v0_8-deep')
theme_color = "#9E3A36" #darkred
color_theme = "#9E3A36"

# for an automatic numerotation
graph_count = 1
table_count = 1

def graph_increment():
    global graph_count
    value = graph_count
    graph_count += 1
    return value


def table_increment():
    global table_count
    value = table_count
    table_count += 1
    return value

def table_name(name):
    title = f"*Table {table_increment()} : {name}* \n"
    return display(Markdown(title))
def graph_name(name):
    title = f"*Graph {graph_increment()} : {name}* \n"
    return display(Markdown(title))
Code
start_time_project = time.time()
Code
df = pd.read_csv('airbnb_london_homework.csv')

1.1) Data wrangling and data visualization :

In this section, we focus on the essential processes of data preparation and visualization specific to the Airbnb case. Data wrangling involves cleaning, structuring, and enhancing the dataset, while data visualization unveils key insights through graphical representations. These pivotal steps lay the groundwork for our predictive modeling and facilitate a deeper understanding of the Airbnb dataset.

Code
table_name("Data Frame Head")
df.head()

Table 1 : Data Frame Head

f_property_type f_room_type f_cancellation_policy f_bed_type f_neighbourhood_cleansed usd_price_day n_accommodates n_bathrooms n_review_scores_rating n_number_of_reviews ... d_selfcheckin d_shampoo d_smartlock d_smokedetector d_smokingallowed d_suitableforevents d_tv d_washer d_washerdryer d_wheelchairaccessible
0 Apartment Private room flexible Real Bed Kingston upon Thames 23 1 1.0 100 1 ... 0 0 0 1 0 0 0 1 0 0
1 Apartment Private room moderate Couch Kingston upon Thames 50 2 1.0 91 15 ... 0 0 0 1 0 0 1 0 0 0
2 Apartment Private room flexible Real Bed Kingston upon Thames 24 2 1.0 80 2 ... 0 0 0 1 0 0 0 1 0 0
3 House Private room flexible Real Bed Kingston upon Thames 50 2 1.5 94 0 ... 0 0 0 1 0 0 1 1 0 0
4 House Private room flexible Real Bed Kingston upon Thames 25 1 1.0 94 0 ... 0 1 0 0 0 0 0 1 0 0

5 rows × 65 columns

Code
nrow, ncol = df.shape
print("Number of row :", nrow,"\n"
      "Number of column :", ncol)
Number of row : 51646 
Number of column : 65
Code
table_name('Column information')
round(df.describe(),2)

Table 2 : Column information

usd_price_day n_accommodates n_bathrooms n_review_scores_rating n_number_of_reviews n_guests_included n_reviews_per_month n_extra_people n_minimum_nights n_beds ... d_selfcheckin d_shampoo d_smartlock d_smokedetector d_smokingallowed d_suitableforevents d_tv d_washer d_washerdryer d_wheelchairaccessible
count 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 ... 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00
mean 94.88 3.06 1.26 92.44 12.35 1.42 1.14 6.67 3.31 1.71 ... 0.04 0.57 0.00 0.78 0.08 0.03 0.67 0.84 0.00 0.06
std 80.93 1.89 0.53 8.44 25.86 1.04 1.24 12.69 29.08 1.17 ... 0.20 0.49 0.04 0.42 0.27 0.16 0.47 0.37 0.03 0.25
min 8.00 1.00 0.00 20.00 0.00 1.00 0.01 0.00 1.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 43.00 2.00 1.00 92.00 0.00 1.00 0.47 0.00 1.00 1.00 ... 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00
50% 74.00 2.00 1.00 94.00 3.00 1.00 0.77 0.00 2.00 1.00 ... 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00
75% 120.00 4.00 1.50 97.00 12.00 1.00 1.17 10.00 3.00 2.00 ... 0.00 1.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00
max 999.00 16.00 8.00 100.00 396.00 16.00 15.00 240.00 5000.00 16.00 ... 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

8 rows × 60 columns

1.1.1) Data Wrangling :

Code
selected_columns_f = [col for col in df.columns if col.startswith('f_')]
selected_columns_n = [col for col in df.columns if col.startswith('n_')]
selected_columns_d = [col for col in df.columns if col.startswith('d_')]

# Function to create a DataFrame with data types and NA info
def create_info_df(columns, df):
    data_types = df[columns].dtypes
    has_na = df[columns].isna().any()
    info_df = pd.DataFrame({'Columns': columns, 'Data Types': data_types, 'Has NA': has_na})
    return info_df

# Create DataFrames for each group
info_df_f = create_info_df(selected_columns_f, df).reset_index(drop = True)
info_df_n = create_info_df(selected_columns_n, df).reset_index(drop = True)
info_df_d = create_info_df(selected_columns_d, df).reset_index(drop = True)

# Now you have three DataFrames (info_df_f, info_df_n, info_df_d) with the desired information.
info_df = pd.concat([info_df_f, info_df_d,info_df_n], axis = 0)
pd.set_option('display.max_rows', None)
table_name("Data type information")
info_df

Table 3 : Data type information

Columns Data Types Has NA
0 f_property_type object False
1 f_room_type object False
2 f_cancellation_policy object False
3 f_bed_type object False
4 f_neighbourhood_cleansed object False
0 d_24hourcheckin int64 False
1 d_airconditioning int64 False
2 d_breakfast int64 False
3 d_buzzerwirelessintercom int64 False
4 d_cabletv int64 False
5 d_carbonmonoxidedetector int64 False
6 d_cats int64 False
7 d_dogs int64 False
8 d_doorman int64 False
9 d_doormanentry int64 False
10 d_dryer int64 False
11 d_elevatorinbuilding int64 False
12 d_essentials int64 False
13 d_familykidfriendly int64 False
14 d_fireextinguisher int64 False
15 d_firstaidkit int64 False
16 d_freeparkingonpremises int64 False
17 d_freeparkingonstreet int64 False
18 d_gym int64 False
19 d_hairdryer int64 False
20 d_hangers int64 False
21 d_heating int64 False
22 d_hottub int64 False
23 d_indoorfireplace int64 False
24 d_internet int64 False
25 d_iron int64 False
26 d_keypad int64 False
27 d_kitchen int64 False
28 d_laptopfriendlyworkspace int64 False
29 d_lockonbedroomdoor int64 False
30 d_lockbox int64 False
31 d_otherpets int64 False
32 d_paidparkingoffpremises int64 False
33 d_petsallowed int64 False
34 d_petsliveonthisproperty int64 False
35 d_pool int64 False
36 d_privateentrance int64 False
37 d_privatelivingroom int64 False
38 d_safetycard int64 False
39 d_selfcheckin int64 False
40 d_shampoo int64 False
41 d_smartlock int64 False
42 d_smokedetector int64 False
43 d_smokingallowed int64 False
44 d_suitableforevents int64 False
45 d_tv int64 False
46 d_washer int64 False
47 d_washerdryer int64 False
48 d_wheelchairaccessible int64 False
0 n_accommodates int64 False
1 n_bathrooms float64 False
2 n_review_scores_rating int64 False
3 n_number_of_reviews int64 False
4 n_guests_included int64 False
5 n_reviews_per_month float64 False
6 n_extra_people int64 False
7 n_minimum_nights int64 False
8 n_beds int64 False
9 n_days_since int64 False
Code
pd.set_option('display.max_rows', 20) #if we don't run this our notebook will not run!
print('There is',df[selected_columns_d].shape[1], 'dummies variable')
There is 49 dummies variable

We can observe that there are no missing values, and both qualitative and quantitative variables have the appropriate data types.

The 49 dummy variables currently have a data type of int64, which uses 64 bytes of memory each. To conserve memory, we will change their data types to bool.

Code
for col in selected_columns_d:
    # Handle missing values if any
    df[col] = df[col].astype('bool')
Code
table_name("Quantitative column information")

round(df[selected_columns_n].describe(),2)

Table 4 : Quantitative column information

n_accommodates n_bathrooms n_review_scores_rating n_number_of_reviews n_guests_included n_reviews_per_month n_extra_people n_minimum_nights n_beds n_days_since
count 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00 51646.00
mean 3.06 1.26 92.44 12.35 1.42 1.14 6.67 3.31 1.71 418.13
std 1.89 0.53 8.44 25.86 1.04 1.24 12.69 29.08 1.17 344.65
min 1.00 0.00 20.00 0.00 1.00 0.01 0.00 1.00 0.00 0.00
25% 2.00 1.00 92.00 0.00 1.00 0.47 0.00 1.00 1.00 228.00
50% 2.00 1.00 94.00 3.00 1.00 0.77 0.00 2.00 1.00 327.00
75% 4.00 1.50 97.00 12.00 1.00 1.17 10.00 3.00 2.00 504.00
max 16.00 8.00 100.00 396.00 16.00 15.00 240.00 5000.00 16.00 2722.00

Let’s check the histogram of our quantitative features :

Code
graph_name('Quantitative features histogram')

df[selected_columns_n].hist(color = color_theme, figsize = (30,30), grid = False, bins = 150);

Graph 1 : Quantitative features histogram

The same for boolean features :

Code
graph_name("Qualitative features histogram")


df = df.replace({False : 0, True : 1})
df[selected_columns_d].hist(color = color_theme, figsize = (30,30), grid = False, bins = 5);

Graph 2 : Qualitative features histogram

We will compile a list of features that exhibit correlations close to zero, or in the case of non-quantitative features, are statistically significant in t-test or ANOVA tests. For linear classifiers like linear regression, we will remove these features. However, for non-linear classifiers capable of capturing complex relationships within the learning model, omitting these columns might unintentionally exclude valuable information that the model can leverage.

Code
col_to_drop = [] 

correlation = df[selected_columns_n].corrwith(df['usd_price_day'])
# Create a list of tuples containing column name and correlation value
correlation_tuples = [(col, correlation[col]) for col in correlation.index]

# Sort the list of tuples based on the absolute correlation value
sorted_correlation_tuples = sorted(correlation_tuples, key=lambda x: abs(x[1]), reverse=True)

columns_below_threshold = [col for col in correlation.index if abs(correlation[col]) < 0.01]
col_to_drop = col_to_drop + columns_below_threshold



graph_name("Correlation Heatmap")
# Create a heatmap using seaborn
plt.figure(figsize=(8, 6))  # Adjust the figure size as needed
sns.heatmap(correlation.to_frame(), annot=True, cmap="Reds", vmin=-1, vmax=1)
plt.show()

# Print the sorted list of tuples
print('Correlation Columns with usd_price_day',sorted_correlation_tuples)
print('----------------------------')
print('Columns below threshold :', columns_below_threshold)

Graph 3 : Correlation Heatmap

Correlation Columns with usd_price_day [('n_accommodates', 0.6489524427264538), ('n_beds', 0.5611478257612806), ('n_bathrooms', 0.4625743083036801), ('n_guests_included', 0.27710201039591514), ('n_extra_people', 0.07369865999319064), ('n_reviews_per_month', -0.04070877786912352), ('n_number_of_reviews', -0.03165823509742117), ('n_review_scores_rating', 0.026065398461563357), ('n_minimum_nights', 0.024322500225114778), ('n_days_since', -0.0010687631410686597)]
----------------------------
Columns below threshold : ['n_days_since']

The code is designed to perform t-tests to determine if certain selected columns in the AirBNB data frame have a significant impact on the ‘usd_price_day’ column. It does so using a 5% significance level. The code groups data based on boolean values (True/False) within the selected columns and executes t-tests to compare these groups against the ‘usd_price_day’ column. The results of these tests are saved in a list called t_test_results. The code then displays the t-statistic and p-value for each column and identifies variables that have a statistically significant impact on ‘usd_price_day’. Additionally, the col_to_drop list is utilized to keep track of non-significant variables.

Code
import pandas as pd
from scipy.stats import chi2_contingency

# Create a DataFrame containing the selected columns
selected_df = df.loc[:, selected_columns_d + ['usd_price_day']]

# Initialize a list to store columns to drop
col_to_drop = []

# Perform chi-square test for each boolean column
for col in selected_columns_d:
    contingency_table = pd.crosstab(index=selected_df[col], columns=selected_df['usd_price_day'])
    chi2, p, _, _ = chi2_contingency(contingency_table)

    # Display the chi-square test results
    print(f"Variable: {col}")
    print(f"Chi-square statistic: {round(chi2, 3)}")
    print(f"P-value: {round(p, 3)}")

    if p >= 0.05:
        col_to_drop.append(col)
        print("The variable does not have a significant effect on 'usd_price_day'.")
    else:
        print("There is a significant association between the variable and 'usd_price_day'.")
    
    print("----------------------------")

# Drop non-significant columns
selected_df = selected_df.drop(columns=col_to_drop)
Variable: d_24hourcheckin
Chi-square statistic: 1968.317
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_airconditioning
Chi-square statistic: 2191.186
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_breakfast
Chi-square statistic: 1148.433
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_buzzerwirelessintercom
Chi-square statistic: 2380.291
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_cabletv
Chi-square statistic: 2401.654
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_carbonmonoxidedetector
Chi-square statistic: 1289.19
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_cats
Chi-square statistic: 701.807
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_dogs
Chi-square statistic: 498.168
P-value: 0.929
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_doorman
Chi-square statistic: 1261.821
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_doormanentry
Chi-square statistic: 853.906
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_dryer
Chi-square statistic: 3466.217
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_elevatorinbuilding
Chi-square statistic: 1766.097
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_essentials
Chi-square statistic: 1278.438
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_familykidfriendly
Chi-square statistic: 7569.337
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_fireextinguisher
Chi-square statistic: 1241.708
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_firstaidkit
Chi-square statistic: 805.27
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_freeparkingonpremises
Chi-square statistic: 1362.188
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_freeparkingonstreet
Chi-square statistic: 126.685
P-value: 1.0
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_gym
Chi-square statistic: 1075.469
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_hairdryer
Chi-square statistic: 2452.776
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_hangers
Chi-square statistic: 1271.105
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_heating
Chi-square statistic: 1145.797
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_hottub
Chi-square statistic: 594.349
P-value: 0.075
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_indoorfireplace
Chi-square statistic: 1987.791
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_internet
Chi-square statistic: 966.623
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_iron
Chi-square statistic: 2150.61
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_keypad
Chi-square statistic: 1307.876
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_kitchen
Chi-square statistic: 1357.124
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_laptopfriendlyworkspace
Chi-square statistic: 1574.681
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_lockonbedroomdoor
Chi-square statistic: 2573.759
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_lockbox
Chi-square statistic: 941.551
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_otherpets
Chi-square statistic: 321.288
P-value: 1.0
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_paidparkingoffpremises
Chi-square statistic: 262.679
P-value: 1.0
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_petsallowed
Chi-square statistic: 714.716
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_petsliveonthisproperty
Chi-square statistic: 1154.76
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_pool
Chi-square statistic: 1623.08
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_privateentrance
Chi-square statistic: 1444.131
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_privatelivingroom
Chi-square statistic: 331.093
P-value: 1.0
The variable does not have a significant effect on 'usd_price_day'.
----------------------------
Variable: d_safetycard
Chi-square statistic: 1414.839
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_selfcheckin
Chi-square statistic: 1155.201
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_shampoo
Chi-square statistic: 1818.228
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_smartlock
Chi-square statistic: 867.779
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_smokedetector
Chi-square statistic: 1180.273
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_smokingallowed
Chi-square statistic: 1584.604
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_suitableforevents
Chi-square statistic: 928.972
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_tv
Chi-square statistic: 6609.716
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_washer
Chi-square statistic: 1718.911
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_washerdryer
Chi-square statistic: 2117.655
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------
Variable: d_wheelchairaccessible
Chi-square statistic: 1048.086
P-value: 0.0
There is a significant association between the variable and 'usd_price_day'.
----------------------------

There is 5 variables that are not significant to the target feature ‘usd_price_day’

The code is designed to perform chi-squared tests for specific columns in the dataset, specifically evaluating their relationship with the ‘usd_price_day’ column. It constructs contingency tables for each selected column and ‘usd_price_day,’ computes the chi-squared statistic and p-value, and stores these results in a list. The code then iterates through the results, printing information about each variable. If a variable’s p-value is below the 5% significance level, it is considered to have a significant effect on ‘usd_price_day.’ Otherwise, the variable is added to a list named col_to_drop, indicating that it doesn’t significantly impact ‘usd_price_day’.

Code
# Create a DataFrame containing the selected columns
selected_df = df.loc[:, selected_columns_f + ['usd_price_day']]

# Perform a chi-squared test for each column
chi2_test_results = []

for col in selected_df.columns:
    if col != 'usd_price_day':
        contingency_table = pd.crosstab(selected_df[col], selected_df['usd_price_day'])
        chi2, p, _, _ = chi2_contingency(contingency_table)
        chi2_test_results.append((col, chi2, p))

# Display the chi-squared test results
for col, chi2, p in chi2_test_results:
    print(f"Variable: {col}")
    print(f"Chi-squared statistic: {round(chi2,3)}")
    print(f"P-value: {round(p,1)}")
    if p < 0.05:  # 5% significance level
        print("The variable has a significant effect on 'usd_price_day'.")
    else:
        col_to_drop += [col]
        print("The variable does not have a significant effect on 'usd_price_day'.")
    print("----------------------------")
Variable: f_property_type
Chi-squared statistic: 5014.69
P-value: 0.0
The variable has a significant effect on 'usd_price_day'.
----------------------------
Variable: f_room_type
Chi-squared statistic: 38493.07
P-value: 0.0
The variable has a significant effect on 'usd_price_day'.
----------------------------
Variable: f_cancellation_policy
Chi-squared statistic: 5156.739
P-value: 0.0
The variable has a significant effect on 'usd_price_day'.
----------------------------
Variable: f_bed_type
Chi-squared statistic: 617.338
P-value: 0.0
The variable has a significant effect on 'usd_price_day'.
----------------------------
Variable: f_neighbourhood_cleansed
Chi-squared statistic: 27694.168
P-value: 0.0
The variable has a significant effect on 'usd_price_day'.
----------------------------
Code
df_ = df.drop(col_to_drop, axis = 1)
selected_columns_f = [col for col in df_.columns if col.startswith('f_')]
selected_columns_n = [col for col in df_.columns if col.startswith('n_')]
selected_columns_d = [col for col in df_.columns if col.startswith('d_')]

print('List of the columns to drop :', col_to_drop)
List of the columns to drop : ['d_dogs', 'd_freeparkingonstreet', 'd_hottub', 'd_otherpets', 'd_paidparkingoffpremises', 'd_privatelivingroom']

We will now perform a rudimentary outlier detection to identify and mark them as missing values. Since we have missing values, we will impute them using Miss Forest to enable the algorithm to handle extreme cases for certain apartments.

We check for each variable if it is within the +- 4 sigma around the mean.

Code
# Assuming df is your DataFrame
select_columns_n = [col for col in df.columns if col.startswith('n')]
select_columns_d = [col for col in df.columns if col.startswith('d')]
select_columns_f = [col for col in df.columns if col.startswith('f')]

outliers_df = pd.DataFrame(columns=['row_number', 'column', 'value'])

for column in select_columns_n:
    col_data = df[column]
    mean_val = np.mean(col_data)
    sd_val = np.std(col_data)
    lower_limit = mean_val - 4 * sd_val
    upper_limit = mean_val + 4 * sd_val
    outliers_mask = (col_data < lower_limit) | (col_data > upper_limit)
    
    outliers_df = pd.concat([outliers_df,
                            pd.DataFrame({'row_number': np.where(outliers_mask)[0],
                                          'column': [column] * np.sum(outliers_mask),
                                          'value': col_data[outliers_mask].tolist()})],
                            ignore_index=True)

print(outliers_df)

# Create a new DataFrame with outliers replaced by NaN
df_with_na = df.copy()
for index, row in outliers_df.iterrows():
    df_with_na.at[row['row_number'], row['column']] = np.nan

# Print the number of rows with more than one outlier value
print(f"Number of rows with more than one outlier value: {len(set(outliers_df['row_number']))}")
     row_number          column value
0            45  n_accommodates    13
1            53  n_accommodates    11
2           263  n_accommodates    11
3           630  n_accommodates    16
4           751  n_accommodates    12
...         ...             ...   ...
3959      51485    n_days_since  1850
3960      51508    n_days_since  1805
3961      51520    n_days_since  1838
3962      51543    n_days_since  1824
3963      51588    n_days_since  2021

[3964 rows x 3 columns]
Number of rows with more than one outlier value: 3283

We found that there are 3,964 outliers out of 3,283 Airbnbs, we put those value in NA into a new data frame called df_with_na. Let’s check the repartition of outlier on each columns.

Code
plt.figure(figsize=(12, 6))  
ax = sns.countplot(data=outliers_df, x='column', color=theme_color)

# Set labels and title
ax.set(xlabel='Column', ylabel='Count')
plt.title("Count of Outliers for Each Variable")

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.show()

Let’s now impute them with the miss forest algorithm :

Code
selected_columns_f = [col for col in df_with_na.columns if col.startswith('f_')]
selected_columns_n = [col for col in df_with_na.columns if col.startswith('n_')]
selected_columns_d = [col for col in df_with_na.columns if col.startswith('d_')]

# Function to create a DataFrame with data types and NA info
def create_info_df(columns, df):
    data_types = df[columns].dtypes
    has_na = df[columns].isna().any()
    info_df = pd.DataFrame({'Columns': columns, 'Data Types': data_types, 'Has NA': has_na})
    return info_df

# Create DataFrames for each group
info_df_f = create_info_df(selected_columns_f, df_with_na).reset_index(drop = True)
info_df_n = create_info_df(selected_columns_n, df_with_na).reset_index(drop = True)
info_df_d = create_info_df(selected_columns_d, df_with_na).reset_index(drop = True)

info_df_n[info_df_n['Has NA'] == True]
Columns Data Types Has NA
0 n_accommodates float64 True
1 n_bathrooms float64 True
2 n_review_scores_rating float64 True
3 n_number_of_reviews float64 True
4 n_guests_included float64 True
5 n_reviews_per_month float64 True
6 n_extra_people float64 True
7 n_minimum_nights float64 True
8 n_beds float64 True
9 n_days_since float64 True
Code
from missforest.missforest import MissForest
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder

# Assuming df_with_na is your DataFrame
select_columns_n = [col for col in df_with_na.columns if col.startswith('n')]
select_columns_d = [col for col in df_with_na.columns if col.startswith('d')]
select_columns_f = [col for col in df_with_na.columns if col.startswith('f')]


selected_columns = select_columns_n
# Extract the selected columns from the DataFrame
df_selected = df_with_na[select_columns_n]

# Create classifiers for numeric and categorical imputation
clf_numeric = RandomForestRegressor(n_jobs=-1)
clf_categorical = RandomForestClassifier(n_jobs=-1)

# Initialize MissForest imputer with classifiers
mf = MissForest(clf_numeric)

# Fit and transform only on the selected columns
df_imputed_selected = mf.fit_transform(df_selected)

# Replace the missing values in the original DataFrame with the imputed values
df_with_na[selected_columns] = df_imputed_selected
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001507 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001143 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001125 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001000 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000758 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001005 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000726 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000898 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001329 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001416 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001039 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000772 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000859 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001045 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001002 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000862 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000679 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000850 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001397 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000882 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002339 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001129 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001260 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000990 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001031 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001062 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000810 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001181 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000915 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000897 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001036 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000776 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001195 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000909 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000980 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000838 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001201 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000929 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000838 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001138 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001301 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000754 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001045 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000841 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001249 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000844 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000771 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000782 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001598 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000851 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000818 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 777
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 2.915100
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000804 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.227560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000944 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 92.850506
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000818 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 672
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 9.883382
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001805 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 782
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.321589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000825 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.051508
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000740 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 5.863635
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000801 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 3.047722
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001396 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 781
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 1.613858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000867 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 533
[LightGBM] [Info] Number of data points in the train set: 48363, number of used features: 9
[LightGBM] [Info] Start training from score 400.970473
Code
selected_columns_f = [col for col in df_with_na.columns if col.startswith('f_')]
selected_columns_n = [col for col in df_with_na.columns if col.startswith('n_')]
selected_columns_d = [col for col in df_with_na.columns if col.startswith('d_')]

# Function to create a DataFrame with data types and NA info
def create_info_df(columns, df):
    data_types = df[columns].dtypes
    has_na = df[columns].isna().any()
    info_df = pd.DataFrame({'Columns': columns, 'Data Types': data_types, 'Has NA': has_na})
    return info_df

# Create DataFrames for each group
info_df_f = create_info_df(selected_columns_f, df_with_na).reset_index(drop = True)
info_df_n = create_info_df(selected_columns_n, df_with_na).reset_index(drop = True)
info_df_d = create_info_df(selected_columns_d, df_with_na).reset_index(drop = True)

info_df_n[info_df_n['Has NA'] == True]
Columns Data Types Has NA

We no longer have missing values. So, let’s observe the distributions of the quantitatives variables before and after the imputation of our variables.

Before Impution :

Code
graph_name("Before Impution")

df[selected_columns_n].hist(color = color_theme, figsize = (30,30), grid = False, bins = 100);

Graph 4 : Before Impution

After Impution :

Code
graph_name("After Impution")

df_with_na[selected_columns_n].hist(color = color_theme, figsize = (30,30), grid = False, bins = 100);

Graph 5 : After Impution

The distributions of our variables don’t seem to have changed; the imputation appears to have worked well.

Code
df_without_na = df.copy()
df = df_with_na

1.1.1) Data Visualisation :

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Assuming df is your DataFrame
dt_moy_quart = df.groupby('f_neighbourhood_cleansed')['usd_price_day'].mean().reset_index()

# Sorting by mean in descending order
dt_moy_quart = dt_moy_quart.sort_values(by='usd_price_day', ascending=False)

# Bar plot
plt.bar(dt_moy_quart['f_neighbourhood_cleansed'], dt_moy_quart['usd_price_day'],
        color=[color_theme] * (len(dt_moy_quart) - 1) + ['blue'])
graph_name("Average price per neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Average price")
plt.xticks(rotation=90)
plt.axhline(y=np.mean(dt_moy_quart['usd_price_day']), color="black", linestyle='--', label='Average AirBnB Price')
plt.text(x=25, y=np.mean(dt_moy_quart['usd_price_day']) + 5, s="Average AirBnB Price", color="black")
plt.show()

Graph 6 : Average price per neighborhood

Code
ax1 = sns.histplot(data=df[df['f_neighbourhood_cleansed'] == "Kensington and Chelsea"], x='usd_price_day', stat='count',
                  color = theme_color,
                 binwidth = 5, label = "Kensington and Chelsea")
ax2 = sns.histplot(data=df[df['f_neighbourhood_cleansed'] == "Barking and Dagenham"], x='usd_price_day', stat='count',
                  color = "black",
                 binwidth = 10, label = "Barking and Dagenham")

graph_name("Distribution of USD Price Day in Kensington and Chelsea and Barking and Dagenham")





# Set labels and title
ax1.set(xlabel='USD Price Day', ylabel='Count')
plt.legend()
plt.show();

Graph 7 : Distribution of USD Price Day in Kensington and Chelsea and Barking and Dagenham

Code
ax = sns.histplot(data=df, x='usd_price_day', stat='count',
                  color = theme_color,
                 binwidth = 10)
graph_name("Distribution of USD Price Day")


# Set labels and title
ax.set(xlabel='USD Price Day', ylabel='Count')
# Show the plot
plt.show();

print('Skewness =', stats.skew(df['usd_price_day']))

Graph 8 : Distribution of USD Price Day

Skewness = 3.1673039705506523

It seems that the distribtuion of the rental price didn’t follow a gaussian distribution. We can see that we have a positive skewness.
A positive skewness in the rental price distribution indicates that the majority of rental properties in the area are relatively affordable, with only a few high-end rentals contributing to the right-skewed shape of the distribution.

Let’s create the different data set that we will use for the rest :

Code
def categorical_to_integer(df, column_name):
    unique_categories = df[column_name].unique()
    category_to_int = {category: i for i, category in enumerate(unique_categories)}
    df[column_name] = df[column_name].map(category_to_int)
    return df


def dummies_col_to_drop(df):
    global selected_columns_d, selected_columns_f, selected_columns_n
    df_new = df.copy()
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    df_new = df_new.drop(col_to_drop, axis = 1)
    df_new = pd.get_dummies(df_new, columns=selected_columns_f)
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    selected_columns_n = [col for col in df_new.columns if col.startswith('n_')]
    selected_columns_d = [col for col in df_new.columns if col.startswith('d_')]
    return(df_new)

def dummies_not_col_to_drop(df):
    global selected_columns_d, selected_columns_f, selected_columns_n
    df_new = df.copy()
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    df_new = pd.get_dummies(df_new, columns=selected_columns_f)
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    selected_columns_n = [col for col in df_new.columns if col.startswith('n_')]
    selected_columns_d = [col for col in df_new.columns if col.startswith('d_')]    
    return(df_new)

def integer_col_to_drop(df):
    global selected_columns_d, selected_columns_f, selected_columns_n
    df_new = df.copy()
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    df_new = df_new.drop(col_to_drop, axis = 1)
    for col in selected_columns_f:
        df_new = categorical_to_integer(df_new,col)
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    selected_columns_n = [col for col in df_new.columns if col.startswith('n_')]
    selected_columns_d = [col for col in df_new.columns if col.startswith('d_')]
    return(df_new)

def integer_not_col_to_drop(df):
    global selected_columns_d, selected_columns_f, selected_columns_n
    df_new = df.copy()
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    for col in selected_columns_f:
        df_new = categorical_to_integer(df_new,col)
    selected_columns_f = [col for col in df_new.columns if col.startswith('f_')]
    selected_columns_n = [col for col in df_new.columns if col.startswith('n_')]
    selected_columns_d = [col for col in df_new.columns if col.startswith('d_')]
    return(df_new)

def scale_data(df_new,col_to_scale):
    scaler = StandardScaler()

    # Find the names of the columns to scale
    columns_to_scale = [col for col in col_to_scale if col in df_new.columns]

    # Apply Z-score scaling to the specified columns
    df_new[columns_to_scale] = scaler.fit_transform(df_new[columns_to_scale])
    return(df_new)

1.2) Model Prediction :

In the preceding section, we generated three potential data sets for our analysis. Now, in this part, we will employ various models and evaluate their performance. Before proceeding, we will partition our dataset into training, testing, and evaluation sets with respective proportions of 70%, 15%, and 15%.

Our approach for each algorithm will be as follows:

  1. Initially, we will identify the data set that performs the best with a base model.
  2. Next, we will fine-tune the parameters and select the optimal configuration.
  3. Finally, we will assess the performance of the best-tuned model on the evaluation set.

Ultimately, we will compare the performance of each algorithm.

a) Linear Regression :

We can’t do a linear regression with a data like that, we need to change into integer our categorial features or into dummies features. Note that because we do a linear regression we can drop the column that we found in the data processing part. Instead of changing them into integer, we will put them into dummies variable.

Data set selection :

Qualitative variables into dummies :

We will try here the data set when we drop the column selected before, and without dropping them

Code
df_lr = dummies_col_to_drop(df)
print('Dimension :', df_lr.shape)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr)
Dimension : (51646, 97)
MSE Linear Regression = 2477.708431650026
--------------------------------
RMSE Linear Regression = 49.776585174658436
Code
df_lr = dummies_not_col_to_drop(df)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

print('Dimension of df :', df.shape)
print('Dimension of df_lr :', df_lr.shape)

print('--------------------------------')

# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr2  = mean_squared_error(y_test,y_pred)
rmse_lr2 = np.sqrt(mse_lr2)

print('MSE Linear Regression (without droping columns) =', mse_lr2)
print('--------------------------------')
print('RMSE Linear Regression (without droping columns) =', rmse_lr2)
Dimension of df : (51646, 65)
Dimension of df_lr : (51646, 103)
--------------------------------
MSE Linear Regression (without droping columns) = 2476.6326533095526
--------------------------------
RMSE Linear Regression (without droping columns) = 49.76577793333038

Qualitative variables into quantitative variables:

Let’s try to just put our qualitative parameters into quantitative one to see if it perform better than the stuff we did previously

Code
df_lr = integer_col_to_drop(df)
    
X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr)
MSE Linear Regression = 2753.5049732928182
--------------------------------
RMSE Linear Regression = 52.47385037609512
Code
df_lr = integer_not_col_to_drop(df)
    
X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression (without droping columns) =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression (without droping columns) =', rmse_lr)
MSE Linear Regression (without droping columns) = 2751.8581296151046
--------------------------------
RMSE Linear Regression (without droping columns) = 52.45815598755931

We can see that, putting the qualitative variables into dummies perform better.

Scale Data:

It appears that transforming our categorical values into dummy variables yields better results. Now, let’s assess how our algorithms perform when we scale the data.

Code
df_lr = dummies_col_to_drop(df)
df_lr = scale_data(df_lr,selected_columns_n)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr)
MSE Linear Regression = 2480.2777280973683
--------------------------------
RMSE Linear Regression = 49.802386771091285
Code
df_lr = dummies_not_col_to_drop(df)
df_lr = scale_data(df_lr,selected_columns_n)
print('Dimension :', df_lr.shape)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# fit a linear model
model_lr = LinearRegression()
model_lr.fit(X_train,y_train)
y_pred = model_lr.predict(X_test)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr)
Dimension : (51646, 103)
MSE Linear Regression = 2477.261251961445
--------------------------------
RMSE Linear Regression = 49.77209310408238

The algorithms perform slightly better when we scale the data. We will proceed to tune our parameters with the scaled data, focusing on the undropped columns.

Next, let’s continue by performing feature selection using a genetic algorithm.

Features selection, Genetic Algorithm :

Code
if True :
    df_lr = dummies_not_col_to_drop(df)
    df_lr = scale_data(df_lr,selected_columns_n)
    X = df_lr.drop(['usd_price_day'], axis = 1)
    y = df_lr['usd_price_day']
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
    X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

    
    # Create a function to evaluate the fitness of an individual (subset of features)
    def evaluate(individual):
        selected_features = [feature for feature, is_selected in zip(X_train.columns, individual) if is_selected]
        X_train_selected = X_train[selected_features]
        X_test_selected = X_test[selected_features]
        model = LinearRegression()
        model.fit(X_train_selected, y_train)
        y_pred = model.predict(X_test_selected)
        mse = mean_squared_error(y_test, y_pred)
        return mse,  # Ensure it returns a tuple with a single value

    v_mse = []

    # Create a DEAP toolbox for the GA
    creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
    creator.create("Individual", list, fitness=creator.FitnessMin)
    toolbox = base.Toolbox()

    # Create a function to generate a random individual (subset of features)
    toolbox.register("attr_bool", np.random.choice, [0, 1], size=len(X_train.columns))
    toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.attr_bool)
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)

    # Register genetic operators
    toolbox.register("mate", tools.cxTwoPoint)
    toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
    toolbox.register("select", tools.selTournament, tournsize=5)

    # Set the GA parameters
    population_size = 150
    num_generations = 15
    crossover_prob  = 0.7
    mutation_prob   = 0.01

    # Initialize the population before the main GA loop
    population = toolbox.population(n=population_size)

    # Main GA loop
    for generation in range(num_generations):
        print(f"Generation {generation + 1}/{num_generations}")

        # Evaluate the fitness of each individual in the population
        for ind in population:
            ind.fitness.values = evaluate(ind)

        # Calculate the best MSE in this generation and store it
        best_mse = min(ind.fitness.values)
        v_mse.append(best_mse)

        # Select the next generation
        offspring = algorithms.varAnd(population, toolbox, cxpb=crossover_prob, mutpb=mutation_prob)

        # Replace the current population with the offspring
        population = toolbox.select(offspring, k=len(population))

    # Get the best individual from the final population
    best_individual = tools.selBest(population, k=1)[0]
    selected_features = [feature for feature, is_selected in zip(X_train.columns, best_individual) if is_selected]

    # Perform linear regression with the selected features
    X_train_selected = X_train[selected_features]
    X_test_selected = X_test[selected_features]
    model_selected_features = LinearRegression()
    model_selected_features.fit(X_train_selected, y_train)
    y_pred_selected_features = model_selected_features.predict(X_test_selected)

    # Calculate MSE and RMSE
    mse_lr_ga = mean_squared_error(y_test, y_pred_selected_features)
    rmse_lr_ga = np.sqrt(mse_lr_ga)
    print('\n--------------------------------')
    print('Selected Features :', selected_features)
    print('--------------------------------')
    print('MSE Linear Regression =', mse_lr_ga)
    print('--------------------------------')
    print('RMSE Linear Regression =', rmse_lr_ga)

    # Create a color palette with a single color '#FF6B6B' repeated for each generation
    salmon_palette = ['#FF6B6B'] * num_generations

    # Use Seaborn to set the color palette
    sns.set_palette(salmon_palette)

    # Create the plot with a black border around the line
    plt.plot(range(1, num_generations + 1), v_mse, marker='o',
             color=theme_color, markerfacecolor='w', markersize=5, markeredgecolor='k', lw=2)
    plt.xlabel('Generation')
    plt.ylabel('MSE')
    graph_name('MSE Evolution')
    plt.grid(False)  # Turn off the grid lines
    plt.show()
Generation 1/15
Generation 2/15
Generation 3/15
Generation 4/15
Generation 5/15
Generation 6/15
Generation 7/15
Generation 8/15
Generation 9/15
Generation 10/15
Generation 11/15
Generation 12/15
Generation 13/15
Generation 14/15
Generation 15/15

--------------------------------
Selected Features : ['n_accommodates', 'n_bathrooms', 'n_review_scores_rating', 'n_number_of_reviews', 'n_guests_included', 'd_airconditioning', 'd_breakfast', 'd_cabletv', 'd_carbonmonoxidedetector', 'd_cats', 'd_doorman', 'd_dryer', 'd_elevatorinbuilding', 'd_essentials', 'd_fireextinguisher', 'd_freeparkingonpremises', 'd_freeparkingonstreet', 'd_gym', 'd_hangers', 'd_heating', 'd_hottub', 'd_indoorfireplace', 'd_laptopfriendlyworkspace', 'd_lockonbedroomdoor', 'd_lockbox', 'd_paidparkingoffpremises', 'd_petsliveonthisproperty', 'd_smartlock', 'd_smokedetector', 'd_suitableforevents', 'd_washer', 'd_washerdryer', 'd_wheelchairaccessible', 'f_room_type_Entire home/apt', 'f_cancellation_policy_flexible', 'f_cancellation_policy_strict', 'f_bed_type_Couch', 'f_neighbourhood_cleansed_Barnet', 'f_neighbourhood_cleansed_Bexley', 'f_neighbourhood_cleansed_Bromley', 'f_neighbourhood_cleansed_Croydon', 'f_neighbourhood_cleansed_Ealing', 'f_neighbourhood_cleansed_Enfield', 'f_neighbourhood_cleansed_Greenwich', 'f_neighbourhood_cleansed_Hackney', 'f_neighbourhood_cleansed_Haringey', 'f_neighbourhood_cleansed_Hillingdon', 'f_neighbourhood_cleansed_Islington', 'f_neighbourhood_cleansed_Kensington and Chelsea', 'f_neighbourhood_cleansed_Lewisham', 'f_neighbourhood_cleansed_Redbridge', 'f_neighbourhood_cleansed_Sutton', 'f_neighbourhood_cleansed_Tower Hamlets', 'f_neighbourhood_cleansed_Waltham Forest', 'f_neighbourhood_cleansed_Westminster']
--------------------------------
MSE Linear Regression = 2539.006592246527
--------------------------------
RMSE Linear Regression = 50.388556163542994

Graph 5 : MSE Evolution

After fine-tuning the parameters of our Genetic Algorithm (GA), we unfortunately did not observe a significant improvement in its performance. Therefore, we will now experiment regularization method.

Regularization :

Code
df_lr = dummies_not_col_to_drop(df)
df_lr = scale_data(df_lr,selected_columns_n)
X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


#To perform CV
X_ = pd.concat([X_train,X_test], axis = 0, ignore_index = True)
y_ = pd.concat([y_train,y_test],  axis = 0, ignore_index = True)

# Create and fit a Lasso model with cross-validated alpha selection and a cross validation of 5
lasso = LassoCV(alphas=[0.005,0.01, 0.1, 1.0, 10.0], cv=cv_fold)
lasso.fit(X_, y_)

# Get the selected alpha and coefficients
best_alphaL1 = lasso.alpha_
lasso_coefs = lasso.coef_

print('Optimal alpha for Lasso :', best_alphaL1)
print('--------------------------------')

# Perform feature selection
selected_features = [feature for feature, coef in zip(X_train.columns, lasso_coefs) if coef != 0]

# Now, you can use the selected_features for further analysis or modeling
X_train_selected = X_train[selected_features]
X_test_selected  = X_test[selected_features]
model_selected_features = LinearRegression()
model_selected_features.fit(X_train_selected, y_train)

y_pred_selected_features = model_selected_features.predict(X_test_selected)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_test,y_pred_selected_features)
rmse_lr_lasso = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr_lasso)
Optimal alpha for Lasso : 0.005
--------------------------------
MSE Linear Regression = 2476.5986803359365
--------------------------------
RMSE Linear Regression = 49.76543660348954
Code
df_lr = dummies_not_col_to_drop(df)
df_lr = scale_data(df_lr,selected_columns_n)
X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

#To perform CV
X_ = pd.concat([X_train,X_test], axis = 0, ignore_index = True)
y_ = pd.concat([y_train,y_test],  axis = 0, ignore_index = True)

# Define a range of alpha values to search
alphas = [0.005,0.01, 0.1, 1.0, 10.0]  # You can adjust this range as needed

# Create a Ridge model
ridge_regressor = Ridge()

# Perform a grid search to find the best alpha with a cross validation of 5
param_grid = {'alpha': alphas}
grid_search = GridSearchCV(ridge_regressor, param_grid, cv=cv_fold, scoring='neg_mean_squared_error')
grid_search.fit(X_, y_)

# Get the best alpha from the grid search
best_alphaL2 = grid_search.best_params_['alpha']

# Create a Ridge model with the best alpha
best_ridge_regressor = Ridge(alpha=best_alphaL2)
best_ridge_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred_ridge = best_ridge_regressor.predict(X_test)

# Evaluate the Ridge model with the best alpha
mse_lr = mean_squared_error(y_test, y_pred_ridge)

print(f"Best Alpha for Ridge: {best_alphaL2}")
print('--------------------------------')
print(f"Mean Squared Error (Ridge) = {mse_lr}")
print('--------------------------------')
print('RMSE Linear Regression (Ridge) =', np.sqrt(mse_lr))
Best Alpha for Ridge: 1.0
--------------------------------
Mean Squared Error (Ridge) = 2476.7040545824048
--------------------------------
RMSE Linear Regression (Ridge) = 49.766495301381276

After tunning parameters for our Genetic Algorithm (GA) and exploring Ridge and Lasso regression, we found that the best performance was achieved using Lasso regression with a regularization coefficient of 0.005. However, upon comparing the Mean Squared Error (MSE) obtained with Lasso regression to the MSE achieved with the linear regression function provided by the sklearn package, it became evident that the sklearn package’s optimization outperforms our custom GA-based approach. Despite our efforts, we were unable to surpass its performance.

Considering the considerable number of features in our dataset (90, which includes the added dummy variables), we have decided not to pursue a grid search or a best subset model, as these methods may not be practical given the dimensionality of the problem.

We can conclude that the most effective model achieved via linear regression involves the conversion of categorical features into dummy variables, and the scaling of the data. This optimization was successfully accomplished through the utilization of Lasso Regression.

Now, let’s assess the performance of our selected algorithm on the evaluation set:

Evaluation :

Code
df_lr = dummies_not_col_to_drop(df)
df_lr = scale_data(df_lr,selected_columns_n)
X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


lasso = Lasso(alpha=best_alphaL1)
lasso.fit(X_train, y_train)

lasso_coefs = lasso.coef_

# Perform feature selection
selected_features = [feature for feature, coef in zip(X_train.columns, lasso_coefs) if coef != 0]

# Now, you can use the selected_features for further analysis or modeling
X_train_selected = X_train[selected_features]
X_evaluate_selected  = X_evaluate[selected_features]
model_selected_features = LinearRegression()
model_selected_features.fit(X_train_selected, y_train)

y_pred_selected_features = model_selected_features.predict(X_evaluate_selected)

# MSE and RMSE :
mse_lr  = mean_squared_error(y_evaluate,y_pred_selected_features)
rmse_lr = np.sqrt(mse_lr)

print('MSE Linear Regression =', mse_lr)
print('--------------------------------')
print('RMSE Linear Regression =', rmse_lr)
MSE Linear Regression = 2703.6068590764185
--------------------------------
RMSE Linear Regression = 51.99621966139864

b) RandomForestRegressor:

Since Random Forest makes use of non-linear interactions between variables, we won’t be dropping any columns in this case.

Data set selection :

Qualitative variables into dummies :

Code
df_rf = dummies_not_col_to_drop(df)

X = df_rf.drop(['usd_price_day'], axis = 1)
y = df_rf['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=seed)  # You can adjust the number of trees (n_estimators) as needed

# Fit the model on the training data
rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

# Evaluate the model (for regression tasks, you can use mean squared error or other appropriate metrics)
mse_rf  = mean_squared_error(y_test,y_pred)
rmse_rf = np.sqrt(mse_rf)

print('MSE Random Forest =', mse_rf)
print('--------------------------------')
print('RMSE Random Forest =', rmse_rf)
MSE Random Forest = 2108.898726683195
--------------------------------
RMSE Random Forest = 45.922747377342255

Qualitative variables into quantitative variables:

Code
df_rf = integer_not_col_to_drop(df)

X = df_rf.drop(['usd_price_day'], axis = 1)
y = df_rf['usd_price_day']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=seed)  # You can adjust the number of trees (n_estimators) as needed

# Fit the model on the training data
rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

# Evaluate the model (for regression tasks, you can use mean squared error or other appropriate metrics)
mse_rf  = mean_squared_error(y_test,y_pred)
rmse_rf = np.sqrt(mse_rf)

print('MSE Random Forest =', mse_rf)
print('--------------------------------')
print('RMSE Random Forest =', rmse_rf)
MSE Random Forest = 2114.79631665708
--------------------------------
RMSE Random Forest = 45.98691462423936

Putting the variables into quantitative one perform better on the test set than than the one with dummies and the one with qualitative variables. We choose to tune the parameters on this one.

Scaling Data: (Categorical to Dummies)

It’s important to highlight that scaling the data for algorithms like Random Forest may not have a significant impact. These algorithms do not consider the magnitude of the data, as their focus is primarily on how they partition the data into decision trees.

Code
df_rf = dummies_not_col_to_drop(df)
df_rf = scale_data(df_rf,selected_columns_n)

X = df_rf.drop(['usd_price_day'], axis = 1)
y = df_rf['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# Create a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=seed)  # You can adjust the number of trees (n_estimators) as needed

# Fit the model on the training data
rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

# Evaluate the model (for regression tasks, you can use mean squared error or other appropriate metrics)
mse_rf  = mean_squared_error(y_test,y_pred)
rmse_rf = np.sqrt(mse_rf)

print('MSE Random Forest =', mse_rf)
print('--------------------------------')
print('RMSE Random Forest =', rmse_rf)
MSE Random Forest = 2111.573403821936
--------------------------------
RMSE Random Forest = 45.95185963399018

The performance are not better than without scalling

The dataset where our baseline Random Forest model performs best is the one where qualitative variables are converted into dummy variables, and data scaling is not applied. We will use this dataset as the basis for tuning the parameters of the Random Forest model.

Tunning parameters :

Code
df_rf = dummies_not_col_to_drop(df)

X = df_rf.drop(['usd_price_day'], axis = 1)
y = df_rf['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

# To deal with CV
X_ = pd.concat([X_train,X_test], axis = 0, ignore_index = True)
y_ = pd.concat([y_train,y_test],  axis = 0, ignore_index = True)

#Run_Code = True

GridSearch :

Code
if Run_Code:
    # Define the parameter grid to search
    param_grid = {    
        'n_estimators': [150, 200,250, 300, 350],
        'max_depth': [None, 1, 2, 3],  
        'min_samples_split': [2, 5],  
        'min_samples_leaf': [1, 2, 3, 4],  
        'max_features': ['sqrt', 'log2'],  
        'random_state': [seed]  
    }

    # Create a Random Forest Regressor
    rf_regressor = RandomForestRegressor()

    # Gridsearch with a cross validation of 5
    grid_search = GridSearchCV(estimator=rf_regressor,
                               param_grid=param_grid,
                               cv=cv_fold,
                               n_jobs = 8,
                               verbose = 1,
                               scoring='neg_mean_squared_error')

    start_time = time.time()
    
    # Perform the grid search
    grid_search.fit(X_, y_)

    # Get the best parameters
    best_params = grid_search.best_params_

    # Create a Random Forest Regressor with the best parameters
    best_rf_regressor = RandomForestRegressor(**best_params)

    # Fit the model on the training data with the best parameters
    best_rf_regressor.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = best_rf_regressor.predict(X_test)


    grid_search_time = time.time() - start_time

    # Evaluate the model with the best parameters
    mse_rf = mean_squared_error(y_test, y_pred)
    rmse_rf = np.sqrt(mse_rf)

    print('Time taken for grid search: {:.2f} seconds'.format(grid_search_time))
    print('--------------------------------')
    print('Best Parameters for Random Forest:', best_params)
    print('--------------------------------')
    print('MSE Random Forest with Best Parameters:', mse_rf)
    print('--------------------------------')
    print('RMSE Random Forest with Best Parameters:', rmse_rf)
Fitting 5 folds for each of 320 candidates, totalling 1600 fits
Time taken for grid search: 2637.31 seconds
--------------------------------
Best Parameters for Random Forest: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 350, 'random_state': 42}
--------------------------------
MSE Random Forest with Best Parameters: 1959.6751837615416
--------------------------------
RMSE Random Forest with Best Parameters: 44.268218664879
/Users/yoannpull/anaconda3/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(

RandomSearch :

Code
if Run_Code:

    def tune_random_forest_hyperparameters(X_train, y_train, cv_fold=5):
        # Define the hyperparameter grid
        grid = {
            'n_estimators': [50, 100, 125 , 150, 175,  200],
            'max_features': ['sqrt', 'log2'],
            'max_depth': [1, 2, 3, 5,10,20],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2, 3, 4]
        }

        # Create the Random Forest model
        model_rf = RandomForestRegressor()

        # Perform randomized search for hyperparameter tuning
        tic = time.time()
        rando_search = RandomizedSearchCV(
            estimator=model_rf,
            param_distributions=grid,
            cv=cv_fold,
            n_iter=500,
            n_jobs=8,
            scoring='neg_mean_squared_error'
        )

        rando_result = rando_search.fit(X_, y_)

        # Print the results
        printstring = f"Randomized search performed in {time.time() - tic:.0f}s\n"
        printstring += f"Best Mean Squared Error: {-rando_result.best_score_:.4f}\n"

        for k, v in rando_result.best_params_.items():
            printstring += f"{k}: {v}\n"

        print(printstring)

        # Optionally, visualize the results
        visualize_search_results(rando_search)

   # Define a color mapping function based on iteration
    def get_color(iteration):
        if iteration < 100:
            return theme_color  # Color for iteration < 100
        elif iteration < 200:
            return 'lightgray'  # Color for iteration < 200
        elif iteration < 300:
            return 'salmon'  # Color for iteration < 300
        elif iteration < 400:
            return 'red'  
        else :
            return 'orange'

    def visualize_search_results(search_results):
        means = search_results.cv_results_['mean_test_score']
        params = search_results.cv_results_['params']

        # Plot the distribution of mean squared errors for different hyperparameters
        plt.figure(figsize=(10, 6))

        for i in range(len(means)):
            color = get_color(i)  # Determine the color based on iteration
            plt.scatter(i, means[i], c=color, s=10)

        graph_name('Randomized Search Results')
        plt.xlabel('Iteration')
        plt.ylabel('Mean Squared Error')
        plt.grid(True)
        plt.show()
    # Usage:
    tune_random_forest_hyperparameters(X_train, y_train, cv_fold=cv_fold)
Randomized search performed in 10167s
Best Mean Squared Error: 2212.2828
n_estimators: 200
min_samples_split: 2
min_samples_leaf: 1
max_features: sqrt
max_depth: 20

Graph 6 : Randomized Search Results

Code
print(f"The one that perform better was the grid search that give a Random forest model with those parameters : \n {best_params}")
The one that perform better was the grid search that give a Random forest model with those parameters : 
 {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 350, 'random_state': 42}

We will then use the Random Forest Regressor using the parameters obtained with the grid search algorith on the evaluate test set.

Evaluation :

Code
df_rf = dummies_not_col_to_drop(df)

X = df_rf.drop(['usd_price_day'], axis = 1)
y = df_rf['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)


rf_model = RandomForestRegressor(**best_params)  

# Fit the model to your data
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_evaluate)

mse_rf  = mean_squared_error(y_evaluate,y_pred)
rmse_rf = np.sqrt(mse_rf)

print('MSE Random Forest =', mse_rf)
print('--------------------------------')
print('RMSE Random Forest =', rmse_rf)
MSE Random Forest = 2159.344900731611
--------------------------------
RMSE Random Forest = 46.46875187404554

c) Neural Network :

We will implement a neural network using the ReLU activation function for the hidden layers and linear regression for the output layer. Since we are working with linear models in this context, we will first remove the column we identified during the data preprocessing stage and scale our data.

Data set selection :

Qualitative variables into dummies :

We will try here the data set when we drop the column selected before, and without dropping them

Code
df_nn = dummies_col_to_drop(df)
X = df_nn.drop(['usd_price_day'], axis = 1)
y = df_nn['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define our neural network model
model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

clear_output(wait=True)
print('MSE Neural Network =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network =', rmse_nn)
MSE Neural Network = 2055.0571797071725
--------------------------------
RMSE Neural Network = 45.33273849776972
Code
df_lr = dummies_not_col_to_drop(df)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)
n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define our neural network model
model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

clear_output(wait=True)
print('MSE Neural Network (without droping columns) =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network (without droping columns) =', rmse_nn)
MSE Neural Network (without droping columns) = 2047.5074967654618
--------------------------------
RMSE Neural Network (without droping columns) = 45.24939222537096

The model without dropping columns perform better

Qualitative variables into quantitative variables:

Let’s try to just put our qualitative parameters into quantitative one to see if it perform better than the stuff we did previously

Code
df_nn = integer_col_to_drop(df)
X = df_nn.drop(['usd_price_day'], axis = 1)
y = df_nn['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define our neural network model

model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_test)

clear_output(wait=True)
# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

print('MSE Neural Network =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network =', rmse_nn)
MSE Neural Network = 2255.9831066786323
--------------------------------
RMSE Neural Network = 47.49719051353072
Code
df_nn = integer_not_col_to_drop(df)
X = df_nn.drop(['usd_price_day'], axis = 1)
y = df_nn['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define our neural network model

model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_test)

clear_output(wait=True)
# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

print('MSE Neural Network (without dropping columns)  =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network (without dropping columns) =', rmse_nn)
MSE Neural Network (without dropping columns)  = 2296.5996017895936
--------------------------------
RMSE Neural Network (without dropping columns) = 47.92285051819845

Scale Data:

It appears that transforming our categorical values into dummy variables yields better results. Now, let’s assess how our algorithms perform when we scale the data.

Code
df_nn = dummies_not_col_to_drop(df)
df_nn = scale_data(df_nn, selected_columns_n)
X = df_nn.drop(['usd_price_day'], axis = 1)
y = df_nn['usd_price_day']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)

n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define our neural network model

model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_test)

clear_output(wait=True)
# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

print('MSE Neural Network =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network =', rmse_nn)
MSE Neural Network = 2122.1226936304456
--------------------------------
RMSE Neural Network = 46.066502945529145

The data transformation that gives us the best performance on our base model is the one with dummy variables, without dropping any columns, and with unscaled data.

Regularization :

Let’s now try to perform L1 and L2 regularization (lasso and ridge) on our input and hidden layers and see how they perform.

Code
df_lr = dummies_not_col_to_drop(df)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X = scale_data(X,selected_columns_n)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)
n, d = X_train.shape

# Set a random seed
np.random.seed(seed)
tf.random.set_seed(seed)

alpha_to_try = [0, 0.001, 0.01, 0.1, 1]

# Define our neural network model
best_mse = []
best_alpha = None  # Initialize the variable to store the best alpha
best_rmse = None   # Initialize the variable to store the best RMSE

for alpha in alpha_to_try:
    model = Sequential()
    model.add(Dense(128, input_dim=d, activation='relu', kernel_regularizer=l1(alpha)))
    model.add(Dense(64, activation='relu', kernel_regularizer=l1(alpha)))
    model.add(Dense(1, activation='linear'))

    # Compile the model
    model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

    # Fit the model on the training data
    model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

    # Make predictions on the test data
    y_pred = model.predict(X_test)

    # Calculate the mean squared error and store it
    mse_nn = mean_squared_error(y_test, y_pred)
    best_mse.append(mse_nn)

    # Check if this is the best result so far
    if best_alpha is None or mse_nn < min(best_mse):
        best_alpha = alpha
        best_rmse = np.sqrt(mse_nn)

    # Clear the output because it took too much place
    clear_output(wait=True) 
    print('Trying alpha =', alpha)

clear_output(wait=True)
best_alphaL1 = best_alpha
mse_nn_L1 = np.min(best_mse)
rmse_nn = np.sqrt(mse_nn)

print('L1 :')
print('--------------------------------')
print('MSE Neural Network =', mse_nn_L1)
print('Best Alpha =', best_alphaL1)
print('Best RMSE =', best_rmse)
print('--------------------------------')
L1 :
--------------------------------
MSE Neural Network = 1961.1414322403084
Best Alpha = 0
Best RMSE = 47.191754396082494
--------------------------------
Code
df_lr = dummies_not_col_to_drop(df)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X = scale_data(X, selected_columns_n)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)
n, d = X_train.shape

# Set a random seed
np.random.seed(seed)
tf.random.set_seed(seed)

alpha_to_try = [0, 0.001, 0.01, 0.1, 1]

# Define your neural network model
best_mse = []
best_alpha = None  # Initialize the variable to store the best alpha
best_rmse = None   # Initialize the variable to store the best RMSE

for alpha in alpha_to_try:
    model = Sequential()
    model.add(Dense(128, input_dim=d, activation='relu', kernel_regularizer=l2(alpha)))
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(alpha)))
    model.add(Dense(1, activation='linear'))

    # Compile the model
    model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

    # Fit the model on the training data
    model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

    # Make predictions on the test data
    y_pred = model.predict(X_test)

    # Calculate the mean squared error and store it
    mse_nn = mean_squared_error(y_test, y_pred)
    best_mse.append(mse_nn)

    # Check if this is the best result so far
    if best_alpha is None or mse_nn < min(best_mse):
        best_alpha = alpha
        best_rmse = np.sqrt(mse_nn)

    # Clear the output because it took too much place
    clear_output(wait=True) 
    print('Trying alpha =', alpha)

clear_output(wait=True)
best_alphaL2 = best_alpha
mse_nn_L2 = np.min(best_mse)
rmse_nn = np.sqrt(mse_nn)

print('L2 :')
print('--------------------------------')
print('MSE Neural Network =', mse_nn_L2)
print('Best Alpha =', best_alphaL2)
print('Best RMSE =', best_rmse)
print('--------------------------------')
L2 :
--------------------------------
MSE Neural Network = 1906.966932673867
Best Alpha = 0
Best RMSE = 45.96420824352791
--------------------------------

We will select the model with L1 regularization since it performed better with alpha = 0.001, and use it as the basis for tuning the parameters of our neural networks.

Tuning Parameters of Our Neural Network:

In the initial phase of parameter tuning, we selected the model that performed the best without any regularization. Next, we fine-tuned the regularization term before making adjustments to the parameters of our neural network. We also explored the application of both L1 and L2 regularization techniques.

The training data for our model was previously chosen, consisting of scale data, columns with limited contribution not removed, and all qualitative variables converted into dummy variables. We chose L1 regularization with an alpha value of 0. It means that we don’t add any regularization.

Code
if mse_nn_L1 > mse_nn_L2 :
    best_alpha = best_alphaL2
else:
    best_alpha = best_alphaL1


# Function to build the Neural Networks
def build_model(hp):
    model = Sequential()
    model.add(Dense(units=hp.Int('units_1', min_value=64, max_value=256, step=1),
                input_dim=d, activation=hp.Choice('activation_1', values=['relu', 'tanh']),kernel_regularizer=l2(best_alpha)))
    model.add(Dense(units=hp.Int('units_2', min_value=32, max_value=128, step=1),
                activation=hp.Choice('activation_2', values=['relu', 'tanh']),kernel_regularizer=l2(best_alpha)))
    model.add(Dense(1, activation='linear'))
    
    optimizer = Adam(learning_rate=hp.Choice('learning_rate', values=[0.001, 0.01, 0.1, 1.0]))
    model.compile(loss='mean_squared_error', optimizer=optimizer)
    
    return model
              
              
start_time = time.time()

# Initialize the RandomSearch tuner
tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=10,  # Adjust the number of trials
    directory='my_tuner_directory'  # Create a directory to store tuner results
)

# Perform hyperparameter search
tuner.search(X_train, y_train, epochs=50, validation_data=(X_test, y_test))

# Get the best hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

# Build and compile the best model
best_model = build_model(best_hps)

# Fit the best model on the training data
best_model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = best_model.predict(X_test)

# Calculate the mean squared error
mse_nn = mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)

end_time = time.time()
execution_time = end_time - start_time

#clear the output because it took too much place
clear_output(wait=True)
print('MSE Neural Network =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network =', rmse_nn)
print('--------------------------------')
print('Execution Time:', execution_time, 'seconds')

# Output the chosen hyperparameters
print('Chosen Hyperparameters:')
print('units_1:', best_hps.get('units_1'))
print('activation_1:', best_hps.get('activation_1'))
print('units_2:', best_hps.get('units_2'))
print('activation_2:', best_hps.get('activation_2'))
print('learning_rate:', best_hps.get('learning_rate'))

# Delete the 'my_turner_directory' where we store the information from our hyper parameters 
if True:
    import shutil

    # Path to the tuner directory
    tuner_directory = 'my_tuner_directory'

    # Remove the tuner directory and its contents
    shutil.rmtree(tuner_directory)
    print('my_tuner_directory deleted')
MSE Neural Network = 2054.0461389188135
--------------------------------
RMSE Neural Network = 45.32158579439618
--------------------------------
Execution Time: 290.14751076698303 seconds
Chosen Hyperparameters:
units_1: 96
activation_1: relu
units_2: 62
activation_2: relu
learning_rate: 0.001
my_tuner_directory deleted

Evaluation :

It appears that our default model outperforms the one achieved through parameter tuning.

Code
if mse_nn_L1 > mse_nn_L2 :
    best_alpha = best_alphaL2
else:
    best_alpha = best_alphaL1

df_lr = dummies_not_col_to_drop(df)

X = df_lr.drop(['usd_price_day'], axis = 1)
y = df_lr['usd_price_day']
X = scale_data(X,selected_columns_n)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
X_evaluate, X_test, y_evaluate, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)
n, d = X_train.shape

np.random.seed(seed)
tf.random.set_seed(seed)

# Define your neural network model
model = Sequential()
model.add(Dense(128, input_dim=d, activation='relu', kernel_regularizer=l1(best_alpha)))
model.add(Dense(64, activation='relu', kernel_regularizer=l1(best_alpha)))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))

# Fit the model on the training data
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Make predictions on the test data
y_pred = model.predict(X_evaluate)

# Calculate the mean squared error
mse_nn = mean_squared_error(y_evaluate, y_pred)
rmse_nn = np.sqrt(mse_nn)

clear_output(wait=True)
print('MSE Neural Network (without droping columns) =', mse_nn)
print('--------------------------------')
print('RMSE Neural Network (without droping columns) =', rmse_nn)
MSE Neural Network (without droping columns) = 2413.8663659507274
--------------------------------
RMSE Neural Network (without droping columns) = 49.131114031240195

1.3) Conclusion :

Code
table_name('MSE of each choosen model on the evaluate set')
print(f"+-----------------------+--------------------------------+")
print(f"|       Model           |   Mean Squared Error (MSE)     |")
print(f"+-----------------------+--------------------------------+")
print(f"| Linear Regression     |  MSE Linear Regression : {round(mse_lr,1)}|")
print(f"| Random Forest         |  MSE Random Forest :     {round(mse_rf,1)}|")
print(f"| Neural Networks       |  MSE Neural Networks :   {round(mse_nn,1)}|")
print(f"+-----------------------+--------------------------------+")

Table 6 : MSE of each choosen model on the evaluate set

+-----------------------+--------------------------------+
|       Model           |   Mean Squared Error (MSE)     |
+-----------------------+--------------------------------+
| Linear Regression     |  MSE Linear Regression : 2703.6|
| Random Forest         |  MSE Random Forest :     2159.3|
| Neural Networks       |  MSE Neural Networks :   2413.9|
+-----------------------+--------------------------------+

The Random Forest algorithm is the model that performs better on the evaluation set.

Code
end_time_project = time.time()

execution_time = end_time_project - start_time_project
minutes = int(execution_time // 60)
seconds = int(execution_time % 60)

print(f"Execution time of the total project is: {minutes} minutes and {seconds} seconds")
Execution time of the total project is: 227 minutes and 59 seconds

2) Predicting firm exit : probabitlity and classification

This case study aims to predict corporate exits, which means companies going out of business. These predictions are crucial for various business decisions, such as supplier selection, loan approvals, and office space leasing. The study uses data from the Bisnode-firms dataset, focusing on small and medium-sized enterprises (SMEs) with annual sales below 10 million euros established in 2012. The goal is to build predictive models for estimating the probability of a firm’s exit from business. We will compare our models in term of AUC.

Code
####################################################
#################### Bisnode exit ####################
####################################################
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
import time
import itertools
import sklearn
import missingno as msno


from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV,RandomizedSearchCV, KFold
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn import model_selection
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from tensorflow import keras

#import tensorflow as tf
#from tensorflow import keras
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense
#from tensorflow.keras.optimizers.legacy import Adam 
#from tensorflow.keras.regularizers import l2, l1
#from keras_tuner.tuners import RandomSearch
#from keras_tuner.engine.hyperparameters import HyperParameters
#import keras_tuner as kt
#from tensorflow.keras.callbacks import EarlyStopping

from deap import base, creator, tools, algorithms

2.1) Data visualization

In this section we will visualize our dataframe ‘bisnode_firm_homework.csv’.

Code
df_2 = pd.read_csv("bisnode_firm_homework.csv")
Code
df_2.head()
year comp_id begin end amort curr_assets curr_liab extra_exp extra_inc extra_profit_loss ... origin nace_main ind2 ind urban_m region_m founded_date exit_date labor_avg default
0 2012 1001541 2012-01-01 2012-12-31 481.481476 9629.629883 1303.703735 0.0 0.000000 0.000000 ... Domestic 5610.0 56.0 3.0 3 Central 2008-02-24 NaN NaN 0
1 2012 1002029 2012-01-01 2012-12-31 14929.629883 203885.187500 120444.453125 0.0 0.000000 0.000000 ... Domestic 2711.0 27.0 2.0 3 East 2006-07-03 NaN 0.458333 0
2 2012 1003200 2012-01-01 2012-12-31 25.925926 22.222221 10996.295898 0.0 0.000000 0.000000 ... Domestic 5630.0 56.0 3.0 1 Central 2003-10-21 2014-08-09 NaN 1
3 2012 1011889 2012-01-01 2012-12-31 36625.925781 160166.671875 18911.111328 0.0 0.000000 0.000000 ... Domestic 5510.0 55.0 3.0 2 West 1992-11-09 NaN 1.621212 0
4 2012 1014183 2012-01-01 2012-12-31 12551.851562 199903.703125 8274.074219 0.0 7.407407 7.407407 ... Domestic 5510.0 55.0 3.0 2 Central 2001-12-21 NaN 0.715278 0

5 rows × 43 columns

Code
print(f"Number of row : {df_2.shape[0]}\nNumber of column : {df_2.shape[1]}")
Number of row : 21723
Number of column : 43

We have 42 input variables. Our output varibale is default, it’s a binary varibale which egal to 1 if the firm exited within 2 years and 0 otherwise.

Code
df_2.columns.values
array(['year', 'comp_id', 'begin', 'end', 'amort', 'curr_assets',
       'curr_liab', 'extra_exp', 'extra_inc', 'extra_profit_loss',
       'fixed_assets', 'inc_bef_tax', 'intang_assets', 'inventories',
       'liq_assets', 'material_exp', 'personnel_exp', 'profit_loss_year',
       'sales', 'share_eq', 'subscribed_cap', 'tang_assets',
       'balsheet_flag', 'balsheet_length', 'balsheet_notfullyear',
       'founded_year', 'exit_year', 'ceo_count', 'foreign', 'female',
       'birth_year', 'inoffice_days', 'gender', 'origin', 'nace_main',
       'ind2', 'ind', 'urban_m', 'region_m', 'founded_date', 'exit_date',
       'labor_avg', 'default'], dtype=object)

Description of the input variables : - year : The year in which the data was recorded. - comp_id : The unique identifier for each company. - begin : The start date of the financial year. - end : The end date of the financial year. - amort : Amortization. - curr_assets : Current assets. - curr_liab : Current liabilities. - extra_exp : Extraordinary expenses. - extra_inc : Extraordinary income. - extra_profit_loss : Extraordinary profit or loss. - fixed_assets : Fixed assets. - inc_bef_tax : Income before tax. - intang_assets : Intangible assets. - inventories : Inventories. - liq_assets : Liquid assets. - material_exp : Material expenses. - personnel_exp : Personnel expenses. - profit_loss_year : Profit or loss for the year. - sales : Sales or revenues. - share_eq : Shareholder equity. - subscribed_cap : Subscribed capital. - tang_assets : Tangible assets. - balsheet_flag : Balance sheet flag. - balsheet_length : Length of the balance sheet. - balsheet_notfullyear : Indicates if the balance sheet is not for a full year. - founded_year : The year in which the company was founded. - exit_year : The year in which the company exited the market (if applicable). - ceo_count : The number of CEOs. - foreign : Indicates if the company is foreign or domestic. - female : Indicates if the CEO is female. - birth_year : The birth year of the CEO. - inoffice_days : The number of days the CEO has been in office. - gender : The gender of the CEO. - origin : The origin of the CEO. - nace_main : The main industry classification code (NACE) of the company. - ind2 : The second-level industry classification code. - ind : The third-level industry classification code. - urban_m : The urbanization level of the company’s location. - region_m : The region in which the company is located. - founded_date : The date on which the company was founded. - exit_date : The date on which the company exited the market (if applicable). - labor_avg : The average number of employees.

Code
df_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21723 entries, 0 to 21722
Data columns (total 43 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  21723 non-null  int64  
 1   comp_id               21723 non-null  int64  
 2   begin                 21723 non-null  object 
 3   end                   21723 non-null  object 
 4   amort                 21683 non-null  float64
 5   curr_assets           21713 non-null  float64
 6   curr_liab             21713 non-null  float64
 7   extra_exp             21723 non-null  float64
 8   extra_inc             21723 non-null  float64
 9   extra_profit_loss     21723 non-null  float64
 10  fixed_assets          21713 non-null  float64
 11  inc_bef_tax           21723 non-null  float64
 12  intang_assets         21713 non-null  float64
 13  inventories           21713 non-null  float64
 14  liq_assets            21713 non-null  float64
 15  material_exp          21683 non-null  float64
 16  personnel_exp         21683 non-null  float64
 17  profit_loss_year      21713 non-null  float64
 18  sales                 21723 non-null  float64
 19  share_eq              21713 non-null  float64
 20  subscribed_cap        21713 non-null  float64
 21  tang_assets           21713 non-null  float64
 22  balsheet_flag         21723 non-null  int64  
 23  balsheet_length       21723 non-null  int64  
 24  balsheet_notfullyear  21723 non-null  int64  
 25  founded_year          19713 non-null  float64
 26  exit_year             2130 non-null   float64
 27  ceo_count             19715 non-null  float64
 28  foreign               19715 non-null  float64
 29  female                19715 non-null  float64
 30  birth_year            16845 non-null  float64
 31  inoffice_days         19715 non-null  float64
 32  gender                19715 non-null  object 
 33  origin                19715 non-null  object 
 34  nace_main             21718 non-null  float64
 35  ind2                  21718 non-null  float64
 36  ind                   21067 non-null  float64
 37  urban_m               21723 non-null  int64  
 38  region_m              21664 non-null  object 
 39  founded_date          21720 non-null  object 
 40  exit_date             2375 non-null   object 
 41  labor_avg             18569 non-null  float64
 42  default               21723 non-null  int64  
dtypes: float64(29), int64(7), object(7)
memory usage: 7.1+ MB
Code
df_2.describe()
year comp_id amort curr_assets curr_liab extra_exp extra_inc extra_profit_loss fixed_assets inc_bef_tax ... foreign female birth_year inoffice_days nace_main ind2 ind urban_m labor_avg default
count 21723.0 2.172300e+04 2.168300e+04 2.171300e+04 2.171300e+04 2.172300e+04 2.172300e+04 2.172300e+04 2.171300e+04 2.172300e+04 ... 19715.000000 19715.000000 16845.000000 19715.000000 21718.000000 21718.000000 21067.000000 21723.000000 18569.000000 21723.000000
mean 2012.0 1.536114e+11 9.054888e+03 9.989281e+04 8.724403e+04 1.222984e+03 2.395827e+03 1.169621e+03 1.488103e+05 5.789926e+03 ... 0.095006 0.259183 1965.548803 3022.327288 4848.366240 48.257805 2.689467 2.081020 0.621691 0.205681
std 0.0 1.381009e+11 4.921305e+04 4.610725e+05 5.110353e+05 1.158699e+05 1.201532e+05 3.092631e+04 1.159423e+06 1.267507e+05 ... 0.280331 0.403386 11.340306 1725.458205 1248.426638 12.554996 0.506977 0.845527 1.586800 0.404207
min 2012.0 1.001541e+06 -1.489630e+04 -3.460000e+04 -8.759259e+03 -1.418519e+03 -1.740741e+02 -2.888889e+05 0.000000e+00 -1.091879e+07 ... 0.000000 0.000000 1920.000000 10.000000 111.000000 1.000000 1.000000 1.000000 0.083333 0.000000
25% 2012.0 2.888468e+10 1.148148e+02 3.251852e+03 3.566667e+03 0.000000e+00 0.000000e+00 0.000000e+00 6.666666e+01 -6.344444e+03 ... 0.000000 0.000000 1957.000000 1859.000000 3314.000000 33.000000 2.000000 1.000000 0.097222 0.000000
50% 2012.0 1.143027e+11 8.555555e+02 1.120741e+04 1.411111e+04 0.000000e+00 0.000000e+00 0.000000e+00 4.303704e+03 2.222222e+02 ... 0.000000 0.000000 1967.000000 2598.000000 5610.000000 56.000000 3.000000 2.000000 0.229167 0.000000
75% 2012.0 2.580121e+11 3.890741e+03 3.896296e+04 4.662593e+04 0.000000e+00 0.000000e+00 0.000000e+00 3.718889e+04 3.925926e+03 ... 0.000000 0.500000 1974.000000 3634.000000 5610.000000 56.000000 3.000000 3.000000 0.513889 0.000000
max 2012.0 4.641050e+11 3.570863e+06 2.066350e+07 4.186429e+07 1.704172e+07 1.709232e+07 3.253704e+06 1.036673e+08 4.378226e+06 ... 1.000000 1.000000 2015.000000 10041.000000 9609.000000 96.000000 3.000000 3.000000 42.118057 1.000000

8 rows × 36 columns

Code
df_2.isna().sum()
year                0
comp_id             0
begin               0
end                 0
amort              40
                ...  
region_m           59
founded_date        3
exit_date       19348
labor_avg        3154
default             0
Length: 43, dtype: int64

Visualize missing values :

Code
graph_name("NA's view")
msno.matrix(df_2);

Graph 9 : NA’s view

Code
graph_name("NA's proportion 1")
msno.bar(df_2);

Graph 10 : NA’s proportion 1

Correlation beetween input and output variable :

Code
graph_name("Correlation")
df_2.drop(['default'], axis = 1).corrwith(df_2['default']).plot.bar(figsize = (20, 10),
                                                                    fontsize = 20,rot = 90, grid = True,
                                                                   color = color_theme);

Graph 11 : Correlation

/var/folders/vc/x750w_6s3t72v9thyj69h67c0000gn/T/ipykernel_2597/2245610646.py:2: FutureWarning: The default value of numeric_only in DataFrame.corrwith is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df_2.drop(['default'], axis = 1).corrwith(df_2['default']).plot.bar(figsize = (20, 10),

Correlation beetwen input variables :

Code
df_2.drop(['year','default'], axis = 1).corr(method='pearson').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
/var/folders/vc/x750w_6s3t72v9thyj69h67c0000gn/T/ipykernel_2597/27891667.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df_2.drop(['year','default'], axis = 1).corr(method='pearson').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
  comp_id amort curr_assets curr_liab extra_exp extra_inc extra_profit_loss fixed_assets inc_bef_tax intang_assets inventories liq_assets material_exp personnel_exp profit_loss_year sales share_eq subscribed_cap tang_assets balsheet_flag balsheet_length balsheet_notfullyear founded_year exit_year ceo_count foreign female birth_year inoffice_days nace_main ind2 ind urban_m labor_avg
comp_id 1.0 0.00095 0.003 0.0028 -0.0071 -0.0059 0.0038 0.0014 0.00069 0.0085 0.0051 0.0014 0.00022 -0.0024 0.00079 -0.00062 0.0079 0.011 -0.00086 0.0036 -0.006 0.0068 4.6e-05 0.013 -0.0016 -0.007 0.0035 0.0095 -0.002 -0.0069 -0.0068 -0.0071 -0.0059 -0.0096
amort 0.00095 1.0 0.48 0.4 0.037 0.06 0.093 0.75 0.12 0.4 0.35 0.32 0.52 0.54 0.07 0.59 0.49 0.3 0.77 0.073 0.02 -0.026 -0.11 -0.039 0.099 0.17 -0.065 -0.084 0.018 -0.11 -0.11 -0.12 -0.0092 0.5
curr_assets 0.003 0.48 1.0 0.67 0.015 0.048 0.13 0.29 0.22 0.31 0.74 0.61 0.75 0.6 0.11 0.74 0.57 0.28 0.26 0.13 0.011 -0.019 -0.15 -0.074 0.13 0.21 -0.092 -0.093 0.022 -0.21 -0.21 -0.23 -0.018 0.5
curr_liab 0.0028 0.4 0.67 1.0 0.21 0.22 0.07 0.32 -0.19 0.21 0.53 0.24 0.52 0.39 -0.28 0.46 0.2 0.16 0.29 0.075 0.012 -0.016 -0.091 -0.021 0.095 0.18 -0.07 -0.07 -0.0079 -0.1 -0.1 -0.13 -0.026 0.34
extra_exp -0.0071 0.037 0.015 0.21 1.0 0.97 0.0076 0.11 0.0045 0.0054 0.0035 0.0073 0.013 0.0094 0.0035 0.014 0.036 0.026 0.0087 0.00023 0.0018 -0.0022 -0.044 -0.014 0.013 0.039 -0.024 -0.025 -0.00018 0.0025 0.0025 0.0025 -0.0086 0.0095
extra_inc -0.0059 0.06 0.048 0.22 0.97 1.0 0.26 0.13 0.06 0.036 0.014 0.023 0.027 0.024 0.061 0.03 0.065 0.048 0.024 1.3e-05 0.0032 -0.003 -0.029 -0.076 0.025 0.034 -0.016 -0.012 0.0011 -0.00065 -0.00059 -0.00061 -0.0096 0.029
extra_profit_loss 0.0038 0.093 0.13 0.07 0.0076 0.26 1.0 0.082 0.22 0.12 0.041 0.063 0.058 0.06 0.22 0.062 0.12 0.09 0.062 -0.00082 0.0054 -0.0033 -0.019 -0.075 0.022 0.025 -0.011 -0.0073 0.0012 -0.012 -0.012 -0.012 -0.0048 0.089
fixed_assets 0.0014 0.75 0.29 0.32 0.11 0.13 0.082 1.0 0.12 0.15 0.2 0.24 0.29 0.3 0.11 0.35 0.6 0.38 0.96 0.058 0.016 -0.019 -0.084 -0.016 0.055 0.14 -0.046 -0.08 0.0096 -0.026 -0.026 -0.03 -0.0053 0.32
inc_bef_tax 0.00069 0.12 0.22 -0.19 0.0045 0.06 0.22 0.12 1.0 0.016 -0.0083 0.33 0.16 0.15 0.94 0.3 0.39 -0.02 0.11 0.013 0.016 -0.019 -0.078 -0.029 0.02 0.014 -0.027 -0.053 0.053 -0.099 -0.1 -0.093 0.0099 0.14
intang_assets 0.0085 0.4 0.31 0.21 0.0054 0.036 0.12 0.15 0.016 1.0 0.28 0.12 0.26 0.28 -0.00042 0.27 0.24 0.15 0.096 0.021 -0.00079 -0.0051 -0.046 -0.057 0.048 0.036 -0.032 -0.03 -0.0072 -0.069 -0.069 -0.071 -0.022 0.21
inventories 0.0051 0.35 0.74 0.53 0.0035 0.014 0.041 0.2 -0.0083 0.28 1.0 0.23 0.58 0.43 -0.079 0.52 0.38 0.16 0.18 0.079 0.0067 -0.013 -0.097 -0.062 0.064 0.13 -0.055 -0.053 0.0095 -0.14 -0.14 -0.17 -0.0077 0.36
liq_assets 0.0014 0.32 0.61 0.24 0.0073 0.023 0.063 0.24 0.33 0.12 0.23 1.0 0.43 0.41 0.23 0.5 0.47 0.17 0.23 0.11 0.0045 -0.011 -0.12 -0.04 0.069 0.13 -0.061 -0.084 0.039 -0.14 -0.14 -0.14 -0.013 0.37
material_exp 0.00022 0.52 0.75 0.52 0.013 0.027 0.058 0.29 0.16 0.26 0.58 0.43 1.0 0.67 0.048 0.96 0.45 0.21 0.27 0.14 0.026 -0.034 -0.14 -0.0011 0.14 0.21 -0.099 -0.07 0.016 -0.19 -0.19 -0.2 -0.032 0.63
personnel_exp -0.0024 0.54 0.6 0.39 0.0094 0.024 0.06 0.3 0.15 0.28 0.43 0.41 0.67 1.0 0.062 0.81 0.45 0.21 0.29 0.14 0.019 -0.025 -0.15 -0.037 0.13 0.21 -0.085 -0.1 0.014 -0.18 -0.18 -0.19 -0.01 0.87
profit_loss_year 0.00079 0.07 0.11 -0.28 0.0035 0.061 0.22 0.11 0.94 -0.00042 -0.079 0.23 0.048 0.062 1.0 0.18 0.35 -0.036 0.096 -0.0038 0.012 -0.015 -0.052 -0.026 0.0033 -0.011 -0.011 -0.037 0.045 -0.058 -0.058 -0.051 0.013 0.08
sales -0.00062 0.59 0.74 0.46 0.014 0.03 0.062 0.35 0.3 0.27 0.52 0.5 0.96 0.81 0.18 1.0 0.52 0.22 0.33 0.15 0.027 -0.035 -0.16 -0.026 0.15 0.23 -0.1 -0.089 0.025 -0.21 -0.21 -0.22 -0.024 0.74
share_eq 0.0079 0.49 0.57 0.2 0.036 0.065 0.12 0.6 0.39 0.24 0.38 0.47 0.45 0.45 0.35 0.52 1.0 0.56 0.48 0.1 0.014 -0.018 -0.17 -0.069 0.1 0.16 -0.068 -0.11 0.042 -0.12 -0.12 -0.12 0.0054 0.43
subscribed_cap 0.011 0.3 0.28 0.16 0.026 0.048 0.09 0.38 -0.02 0.15 0.16 0.17 0.21 0.21 -0.036 0.22 0.56 1.0 0.28 0.051 0.0097 -0.0095 -0.074 -0.031 0.055 0.16 -0.043 -0.039 -0.013 -0.023 -0.022 -0.02 -0.0019 0.24
tang_assets -0.00086 0.77 0.26 0.29 0.0087 0.024 0.062 0.96 0.11 0.096 0.18 0.23 0.27 0.29 0.096 0.33 0.48 0.28 1.0 0.034 0.018 -0.022 -0.077 -0.012 0.051 0.13 -0.042 -0.078 0.012 -0.024 -0.023 -0.027 -0.00081 0.31
balsheet_flag 0.0036 0.073 0.13 0.075 0.00023 1.3e-05 -0.00082 0.058 0.013 0.021 0.079 0.11 0.14 0.14 -0.0038 0.15 0.1 0.051 0.034 1.0 -0.17 0.17 -0.026 -0.068 0.044 0.12 -0.027 0.0093 -0.012 -0.048 -0.048 -0.046 -0.016 0.12
balsheet_length -0.006 0.02 0.011 0.012 0.0018 0.0032 0.0054 0.016 0.016 -0.00079 0.0067 0.0045 0.026 0.019 0.012 0.027 0.014 0.0097 0.018 -0.17 1.0 -0.83 -0.18 0.1 0.077 -0.027 -0.028 -0.091 0.15 -0.05 -0.05 -0.048 -0.002 -0.00064
balsheet_notfullyear 0.0068 -0.026 -0.019 -0.016 -0.0022 -0.003 -0.0033 -0.019 -0.019 -0.0051 -0.013 -0.011 -0.034 -0.025 -0.015 -0.035 -0.018 -0.0095 -0.022 0.17 -0.83 1.0 0.23 -0.14 -0.094 0.031 0.037 0.12 -0.19 0.067 0.067 0.065 0.0064 -0.011
founded_year 4.6e-05 -0.11 -0.15 -0.091 -0.044 -0.029 -0.019 -0.084 -0.078 -0.046 -0.097 -0.12 -0.14 -0.15 -0.052 -0.16 -0.17 -0.074 -0.077 -0.026 -0.18 0.23 1.0 0.18 -0.077 -0.028 0.063 0.43 -0.58 0.24 0.24 0.23 0.033 -0.14
exit_year 0.013 -0.039 -0.074 -0.021 -0.014 -0.076 -0.075 -0.016 -0.029 -0.057 -0.062 -0.04 -0.0011 -0.037 -0.026 -0.026 -0.069 -0.031 -0.012 -0.068 0.1 -0.14 0.18 1.0 0.0094 -0.021 0.0073 0.065 -0.092 0.068 0.069 0.075 -0.037 -0.021
ceo_count -0.0016 0.099 0.13 0.095 0.013 0.025 0.022 0.055 0.02 0.048 0.064 0.069 0.14 0.13 0.0033 0.15 0.1 0.055 0.051 0.044 0.077 -0.094 -0.077 0.0094 1.0 0.084 0.012 -0.0025 -0.017 -0.031 -0.031 -0.03 -0.049 0.13
foreign -0.007 0.17 0.21 0.18 0.039 0.034 0.025 0.14 0.014 0.036 0.13 0.13 0.21 0.21 -0.011 0.23 0.16 0.16 0.13 0.12 -0.027 0.031 -0.028 -0.021 0.084 1.0 -0.19 -0.074 -0.098 -0.039 -0.04 -0.046 -0.026 0.22
female 0.0035 -0.065 -0.092 -0.07 -0.024 -0.016 -0.011 -0.046 -0.027 -0.032 -0.055 -0.061 -0.099 -0.085 -0.011 -0.1 -0.068 -0.043 -0.042 -0.027 -0.028 0.037 0.063 0.0073 0.012 -0.19 1.0 0.0099 -0.064 0.19 0.19 0.19 0.058 -0.093
birth_year 0.0095 -0.084 -0.093 -0.07 -0.025 -0.012 -0.0073 -0.08 -0.053 -0.03 -0.053 -0.084 -0.07 -0.1 -0.037 -0.089 -0.11 -0.039 -0.078 0.0093 -0.091 0.12 0.43 0.065 -0.0025 -0.074 0.0099 1.0 -0.34 0.2 0.2 0.2 0.035 -0.1
inoffice_days -0.002 0.018 0.022 -0.0079 -0.00018 0.0011 0.0012 0.0096 0.053 -0.0072 0.0095 0.039 0.016 0.014 0.045 0.025 0.042 -0.013 0.012 -0.012 0.15 -0.19 -0.58 -0.092 -0.017 -0.098 -0.064 -0.34 1.0 -0.21 -0.21 -0.2 0.023 0.019
nace_main -0.0069 -0.11 -0.21 -0.1 0.0025 -0.00065 -0.012 -0.026 -0.099 -0.069 -0.14 -0.14 -0.19 -0.18 -0.058 -0.21 -0.12 -0.023 -0.024 -0.048 -0.05 0.067 0.24 0.068 -0.031 -0.039 0.19 0.2 -0.21 1.0 1.0 0.95 0.022 -0.14
ind2 -0.0068 -0.11 -0.21 -0.1 0.0025 -0.00059 -0.012 -0.026 -0.1 -0.069 -0.14 -0.14 -0.19 -0.18 -0.058 -0.21 -0.12 -0.022 -0.023 -0.048 -0.05 0.067 0.24 0.069 -0.031 -0.04 0.19 0.2 -0.21 1.0 1.0 0.95 0.021 -0.14
ind -0.0071 -0.12 -0.23 -0.13 0.0025 -0.00061 -0.012 -0.03 -0.093 -0.071 -0.17 -0.14 -0.2 -0.19 -0.051 -0.22 -0.12 -0.02 -0.027 -0.046 -0.048 0.065 0.23 0.075 -0.03 -0.046 0.19 0.2 -0.2 0.95 0.95 1.0 0.02 -0.16
urban_m -0.0059 -0.0092 -0.018 -0.026 -0.0086 -0.0096 -0.0048 -0.0053 0.0099 -0.022 -0.0077 -0.013 -0.032 -0.01 0.013 -0.024 0.0054 -0.0019 -0.00081 -0.016 -0.002 0.0064 0.033 -0.037 -0.049 -0.026 0.058 0.035 0.023 0.022 0.021 0.02 1.0 -0.0087
labor_avg -0.0096 0.5 0.5 0.34 0.0095 0.029 0.089 0.32 0.14 0.21 0.36 0.37 0.63 0.87 0.08 0.74 0.43 0.24 0.31 0.12 -0.00064 -0.011 -0.14 -0.021 0.13 0.22 -0.093 -0.1 0.019 -0.14 -0.14 -0.16 -0.0087 1.0
Code
graph_name("Barplot Quantitative Variables")
df_2.hist(color = color_theme, figsize = (30,30), grid = False, bins = 5);

Graph 12 : Barplot Quantitative Variables

2.2) Data cleaning

We will make 3 differents dataframe with different imputation or cleaning.

Code
df_21 = df_2.copy()
df_22 = df_2.copy()
df_23 = df_2.copy()

For our 2 first dataframe we drop the 4th variables because we don’t need them. We drop ‘female’ because we have ‘gender’ and we drop ‘founded_date’ because we have ‘founded_year’. And the last column we drop is ‘birth_year’ because it’s the column with the most NA. And for all our datframe we drop ‘exit_year’ and ‘exit_date’ because there is so many missing values.

Code
df_21 = df_21.drop(['year', 'comp_id', 'begin', 'end', 'exit_year','exit_date', 'female', 'birth_year', 'founded_date'], axis = 1)
df_22 = df_22.drop(['year', 'comp_id', 'begin', 'end', 'exit_year','exit_date', 'female', 'birth_year', 'founded_date'], axis = 1)
df_23 = df_23.drop(['exit_year', 'exit_date'], axis = 1)

2.2.1) first dataframe

2.2.1.1) cleaning

We drop rows with NA if there is less than 100 NA for the variable.

Code
for column in df_21.columns.values:
    if (df_21[column].isna().sum() > 0 ) & (df_21[column].isna().sum() < 100):
        df_21 = df_21.dropna(subset=[column])
Code
df_21.isna().sum()
amort                      0
curr_assets                0
curr_liab                  0
extra_exp                  0
extra_inc                  0
extra_profit_loss          0
fixed_assets               0
inc_bef_tax                0
intang_assets              0
inventories                0
liq_assets                 0
material_exp               0
personnel_exp              0
profit_loss_year           0
sales                      0
share_eq                   0
subscribed_cap             0
tang_assets                0
balsheet_flag              0
balsheet_length            0
balsheet_notfullyear       0
founded_year            1989
ceo_count               1987
foreign                 1987
inoffice_days           1987
gender                  1987
origin                  1987
nace_main                  0
ind2                       0
ind                      649
urban_m                    0
region_m                   0
labor_avg               3130
default                    0
dtype: int64
Code
df_21.shape
(21610, 34)

We look the correlation of ind with others input variables.

Code
print(df_21.corrwith(df_21['ind']))
amort                  -0.117394
curr_assets            -0.223795
curr_liab              -0.122967
extra_exp               0.002472
extra_inc               0.000359
extra_profit_loss      -0.007845
fixed_assets           -0.027651
inc_bef_tax            -0.092485
intang_assets          -0.068352
inventories            -0.161719
liq_assets             -0.135781
material_exp           -0.202872
personnel_exp          -0.187134
profit_loss_year       -0.050224
sales                  -0.220055
share_eq               -0.121802
subscribed_cap         -0.019166
tang_assets            -0.025620
balsheet_flag          -0.045661
balsheet_length        -0.047909
balsheet_notfullyear    0.065159
founded_year            0.230713
ceo_count              -0.028645
foreign                -0.046350
inoffice_days          -0.196528
nace_main               0.953623
ind2                    0.953080
ind                     1.000000
urban_m                 0.020534
labor_avg              -0.157201
default                 0.147214
dtype: float64

ind has a big correlation with ind2 and nace_main so we drop ind.

Code
df_21 = df_21.drop(['ind'], axis = 1)
Code
df_21.isna().sum()
amort                      0
curr_assets                0
curr_liab                  0
extra_exp                  0
extra_inc                  0
extra_profit_loss          0
fixed_assets               0
inc_bef_tax                0
intang_assets              0
inventories                0
liq_assets                 0
material_exp               0
personnel_exp              0
profit_loss_year           0
sales                      0
share_eq                   0
subscribed_cap             0
tang_assets                0
balsheet_flag              0
balsheet_length            0
balsheet_notfullyear       0
founded_year            1989
ceo_count               1987
foreign                 1987
inoffice_days           1987
gender                  1987
origin                  1987
nace_main                  0
ind2                       0
urban_m                    0
region_m                   0
labor_avg               3130
default                    0
dtype: int64
Code
len(df_21[df_21.isna().sum(axis=1) >= 5 ])
1987

There is 1986 rows with at least 5 NA so we drop this 1986 rows.

Code
df_21 = df_21.dropna(subset=df_21.columns, thresh=len(df_21.columns) - 5)
Code
df_21.isna().sum()
amort                      0
curr_assets                0
curr_liab                  0
extra_exp                  0
extra_inc                  0
extra_profit_loss          0
fixed_assets               0
inc_bef_tax                0
intang_assets              0
inventories                0
liq_assets                 0
material_exp               0
personnel_exp              0
profit_loss_year           0
sales                      0
share_eq                   0
subscribed_cap             0
tang_assets                0
balsheet_flag              0
balsheet_length            0
balsheet_notfullyear       0
founded_year               2
ceo_count                  0
foreign                    0
inoffice_days              0
gender                     0
origin                     0
nace_main                  0
ind2                       0
urban_m                    0
region_m                   0
labor_avg               2705
default                    0
dtype: int64

We drop the two lines with NA in founded_year.

Code
df_21 = df_21.dropna(subset=['founded_year'])

We transform object columns in float column.

Code
object_column = []

for column in df_21.columns.values :
    if df_21[column].dtype == 'object':
        object_column.append(column)
        
        
label_endoders = {}

for column in object_column:
    LE = LabelEncoder()
    df_21[column] = LE.fit_transform(df_21[column]).astype(float)
    label_endoders[column] = LE

2.2.1.2) labor_avg

We have to focus on NA of labor_avg. We will impute missing values with a regression.

Code
# dataframe 1 without NA
df_21_without_na = df_21.dropna(subset=['labor_avg'])
# dataframe 1 with NA
df_21_with_na = df_21[pd.isna(df_21['labor_avg'])]
Code
# define our input and output variables for the model 
X = df_21_without_na.drop(['labor_avg'], axis = 1)
y = df_21_without_na['labor_avg']
Code
# take only column with a correlation with labor_avg > 0.3
column_cor_with_labor = []
for column in X.columns.values :
    if abs(y.corr(X[column])) > 0.3 :
        column_cor_with_labor.append(column)
Code
X = X[column_cor_with_labor]
Code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42) 
linear regression
Code
# define the model
model_lr = LinearRegression()

# fit the model
model_lr.fit(X_train,y_train)

# prediction
y_pred = model_lr.predict(X_test)

# MSE :
mse_lr  = mean_squared_error(y_test,y_pred)

print('MSE Linear Regression =', mse_lr)
MSE Linear Regression = 0.6833296016478182
ridge regression
Code
# generate an array of alphas
alphas = 10**np.linspace(10,-2,100)*0.5


ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error')

#fit the model
ridgecv.fit(X_train, y_train)

ridge = Ridge(alpha = ridgecv.alpha_)

ridge.fit(X_train, y_train)
print('MSE Ridge : = ',mean_squared_error(y_test, ridge.predict(X_test)))
MSE Ridge : =  0.683303463414254
lasso regression
Code
lasso = Lasso(max_iter = 10000)
lassocv = LassoCV(alphas = None, cv = 10, max_iter = 100000)
lassocv.fit(X_train, y_train)
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
print('MSE Lasso :', mean_squared_error(y_test, lasso.predict(X_test)))
MSE Lasso : 0.6626442785545502

2.2.1.3) dataframe 1 final

We use lasso regression to predict our column labor_avg.

Code
X_final = df_21_with_na[column_cor_with_labor]
y_final = lasso.predict(X_final)
y_final
array([0.24282958, 0.23381048, 0.23442006, ..., 0.22247635, 0.22824688,
       0.26779084])
Code
df_21_with_na['labor_avg'] = y_final

# concat our 2 dataframe
df_21 = pd.concat([df_21_with_na, df_21_without_na], ignore_index=False) 

# sort by index
df_21 = df_21.sort_index()

# replace negative values by 0
for column in df_21.columns.values:
    if df_21[column].dtype != 'object':
        df_21[column] = df_21[column].apply(lambda x: max(0, x))
C:\Users\User\AppData\Local\Temp\ipykernel_14784\2470892243.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_21_with_na['labor_avg'] = y_final

2.2.2) Second dataframe

Code
# select our categorical and numerical columns
numeric_column = []
categorical_column = []

for colum in df_22.columns.values:
    if df_22[colum].dtype == 'object':
        categorical_column.append(colum)
    else :
        numeric_column.append(colum)

print(f'Nuerical columns : {numeric_column}')
print(f'Categorical columns : {categorical_column}')
Nuerical columns : ['amort', 'curr_assets', 'curr_liab', 'extra_exp', 'extra_inc', 'extra_profit_loss', 'fixed_assets', 'inc_bef_tax', 'intang_assets', 'inventories', 'liq_assets', 'material_exp', 'personnel_exp', 'profit_loss_year', 'sales', 'share_eq', 'subscribed_cap', 'tang_assets', 'balsheet_flag', 'balsheet_length', 'balsheet_notfullyear', 'founded_year', 'ceo_count', 'foreign', 'inoffice_days', 'nace_main', 'ind2', 'ind', 'urban_m', 'labor_avg', 'default']
Categorical columns : ['gender', 'origin', 'region_m']

We will use IterativeImputer algorithm. The IterativeImputer algorithm in scikit-learn is a method for handling missing values based on regression. It’s an iterative approach that uses a regression model to impute missing values at each iteration, using other variables to predict the missing variable. The algorithm works in several steps:

Initially impute missing values with a simple strategy like mean, median, or mode. Then, fit a regression model to predict missing values from other variables. This regression can be linear, nonlinear, or any other type depending on the data. Impute missing values using the regression model. Steps 2 and 3 are repeated multiple times until convergence is achieved. Convergence is determined by a defined stopping criterion, such as the maximum number of iterations or the convergence of the mean differences between consecutive imputations.

The IterativeImputer algorithm is useful when dealing with Missing Not At Random (MNAR) data, where missing values are related to the value of the variable itself. It can be used for imputing continuous or categorical variables, although it may be slower than other imputation methods like simple imputation.

Code
# Impute missing values using iterative imputation
iterative_imputer = IterativeImputer(random_state=seed, max_iter=20, initial_strategy='median')
df_22[numeric_column] = iterative_imputer.fit_transform(df_22[numeric_column])

# Create an object to imput with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Apply imputation in our categorical columns
df_22[categorical_column] = categorical_imputer.fit_transform(df_22[categorical_column])
Code
# transform object columns in float columns

bject_column = []

for column in df_22.columns.values :
    if df_22[column].dtype == 'object':
        object_column.append(column)
        
        
label_endoders = {}

for column in object_column:
    LE = LabelEncoder()
    df_22[column] = LE.fit_transform(df_22[column]).astype(float)
    label_endoders[column] = LE

2.2.3) Third dataframe

For this dataframe we have more input variables than our others dataframe. We will impute missing values like our second dataframe with iterative imputation.

Code
# select our categorical and numerical columns
numeric_column = []
categorical_column = []

for colum in df_23.columns.values:
    if df_23[colum].dtype == 'object':
        categorical_column.append(colum)
    else :
        numeric_column.append(colum)

print(f'Nuerical columns : {numeric_column}')
print(f'Categorical columns : {categorical_column}')
Nuerical columns : ['year', 'comp_id', 'amort', 'curr_assets', 'curr_liab', 'extra_exp', 'extra_inc', 'extra_profit_loss', 'fixed_assets', 'inc_bef_tax', 'intang_assets', 'inventories', 'liq_assets', 'material_exp', 'personnel_exp', 'profit_loss_year', 'sales', 'share_eq', 'subscribed_cap', 'tang_assets', 'balsheet_flag', 'balsheet_length', 'balsheet_notfullyear', 'founded_year', 'ceo_count', 'foreign', 'female', 'birth_year', 'inoffice_days', 'nace_main', 'ind2', 'ind', 'urban_m', 'labor_avg', 'default']
Categorical columns : ['begin', 'end', 'gender', 'origin', 'region_m', 'founded_date']
Code
# Impute missing values using iterative imputation
iterative_imputer = IterativeImputer(random_state=seed, max_iter=20, initial_strategy='median')
df_23[numeric_column] = iterative_imputer.fit_transform(df_23[numeric_column])

# Create an object to imput with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Apply imputation in our categorical columns
df_23[categorical_column] = categorical_imputer.fit_transform(df_23[categorical_column])
Code
object_column = []

for column in df_23.columns.values :
    if df_23[column].dtype == 'object':
        object_column.append(column)
        
        
label_endoders = {}

for column in object_column:
    LE = LabelEncoder()
    df_23[column] = LE.fit_transform(df_23[column]).astype(float)
    label_endoders[column] = LE

2.3) Classifier

In this part we will do functions of different classifier to find the auc score with our different dataframe.

Logistic regression

Code
def LR(df, scale = False, gridsearch = False):
    
    X = df.drop(['default'], axis = 1)
    y = df['default']
    
    if scale:
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed) 
        sc_X = StandardScaler()
        X_train = pd.DataFrame(sc_X.fit_transform(X_train))
        X_test = pd.DataFrame(sc_X.transform(X_test))
        
    else:
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed) 
    
    if gridsearch:
        
        param_grid = {'max_iter' : [2000, 3000, 4000], 
                      'solver' : ['lbfgs', 'liblinear'],
                      'random_state' : [seed]}

        # Create a base model
        clf = LogisticRegression()
        
        # Instantiate the grid search model
        grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                                   cv = cv_fold, scoring = 'roc_auc')
        
        
        grid_result_lr = grid_search.fit(X_train, y_train)
        
        best_lr = grid_result_lr.best_estimator_
        # Make predictions on the test data
        y_pred = best_lr.predict_proba(X_test)[:,1]

        auc = roc_auc_score(y_test, y_pred)
        
        print(f'auc logistic regression with best parameters : {auc} with {grid_result_lr.best_params_}')
        
    else :
        
        logistic_regression = LogisticRegression(max_iter = 10000)
    
        # Fit the model on the training data
        logistic_regression.fit(X_train, y_train)
    
        # Make predictions on the test data
        y_pred = logistic_regression.predict_proba(X_test)[:,1]
    
        auc = roc_auc_score(y_test, y_pred)
    
        print(f'auc regression logistic : {auc}')
    
    

Random forest

Code
def RF(df, scale=False, gridsearch = False):
    
    X = df.drop(['default'], axis = 1)
    y = df['default']
    
    if scale:
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed) 
        sc_X = StandardScaler()
        X_train = pd.DataFrame(sc_X.fit_transform(X_train))
        X_test = pd.DataFrame(sc_X.transform(X_test))
        
    else:
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed) 
    
    if gridsearch :
        
        param_grid = {    
        'n_estimators': [50, 100],
        'max_depth': [None, 10, 20],  
        'min_samples_split': [2, 5],  
        'min_samples_leaf': [1, 2],  
        'max_features': ['sqrt', 'log2'],  
        'random_state': [seed]  
        }

        # Create a base model
        clf = RandomForestClassifier(random_state=seed) #Initialize with whatever parameters you want to
        
        # Instantiate the grid search model
        grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                                   cv = cv_fold, scoring = 'roc_auc')
        
        
        grid_result_rf = grid_search.fit(X_train, y_train)
        
        best_rf = grid_result_rf.best_estimator_
        # Make predictions on the test data
        y_pred = best_rf.predict_proba(X_test)[:,1]

        auc = roc_auc_score(y_test, y_pred)
        
        print(f'roc random forest with best parameters : {auc} with {grid_result_rf.best_params_}')
        
    else: 
        # Create a Random Forest Regressor
        rf_classifer = RandomForestClassifier(random_state=seed)  # You can adjust the number of trees (n_estimators) as needed
    
        # Fit the model on the training data
        rf_classifer.fit(X_train, y_train)
    
        # Make predictions on the test data
        y_pred = rf_classifer.predict_proba(X_test)[:,1]
    
        auc = roc_auc_score(y_test, y_pred)
    
        print(f'auc random forest : {auc}')
    

Neural Networks

Code
def NN(df, scale=False, gridsearch = False):
    
    X = df.drop(['default'], axis = 1)
    y = df['default']
    
    model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(df.shape[1]-1,)),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(2, activation='softmax')
    ])
    
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    if scale:
        
        X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
        X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed) 
        
        
        sc_X = StandardScaler()
        X_train = pd.DataFrame(sc_X.fit_transform(X_train))
        X_test = pd.DataFrame(sc_X.transform(X_test))
        X_val = pd.DataFrame(sc_X.transform(X_val))
        
    else:
        
        X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.3, random_state = seed) 
        X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size = 0.5, random_state = seed)
        
    if gridsearch :
        
        param_grid = {
            'optimizer': ['adam', 'sgd'],  # You can add other optimizers
            'neurons_layer1': [64, 128],
            'neurons_layer2': [32, 64],
            'batch_size': [5, 10],
            'epochs': [10, 20],
            'validation_data'  : [(X_val, y_val)]
        }
        
        
        clf = keras.Sequential([
                                keras.layers.Dense(128, activation='relu', input_shape=(df.shape[1]-1,)),
                                keras.layers.Dense(64, activation='relu'),
                                keras.layers.Dense(2, activation='softmax')
                                ])
        
        # Instantiate the grid search model
        grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                                   cv = cv_fold, scoring = 'roc_auc')
        
        
        grid_result_rf = grid_search.fit(X_train, y_train)
        
        best_rf = grid_result_rf.best_estimator_
        # Make predictions on the test data
        
        y_probs = best_rf.predict(X_test)

        # Extrait les probabilités pour la classe positive (class 1)
        y_pred = y_probs[:, 1]

        # Calculez le ROC AUC
        auc = roc_auc_score(y_test, y_pred)

        print(f'auc neural network with best parameters : {auc} with {grid_result_rf.best_params_}')
        
        
    else: 
        
        model.fit(X_train, y_train, epochs=20, batch_size=5, validation_data=(X_val, y_val))
        
        y_probs = model.predict(X_test)

        # Extrait les probabilités pour la classe positive (class 1)
        y_pred = y_probs[:, 1]

        # Calculez le ROC AUC
        auc = roc_auc_score(y_test, y_pred)

        print(f'auc neural network : {auc}')

Apply classifier on our dataframe

Dataframe 1

Logistic Regression
Code
LR(df_21)
auc regression logistic : 0.7403623677116006
Code
LR(df_21, True)
auc regression logistic : 0.7391690002259819
Code
LR(df_21, scale = False, gridsearch = True)
auc logistic regression with best parameters : 0.7404052020245999 with {'max_iter': 2000, 'random_state': 42, 'solver': 'liblinear'}
Code
LR(df_21, scale = True, gridsearch=True)
auc logistic regression with best parameters : 0.7391690002259819 with {'max_iter': 2000, 'random_state': 42, 'solver': 'lbfgs'}
Random Forest
Code
RF(df_21)
auc random forest : 0.8195816973739922
Code
RF(df_21, scale = True)
auc random forest : 0.8195743820489326
Code
RF(df_21, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8211556380356008 with {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
RF(df_21, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8212896054342818 with {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Neural Networks
Code
NN(df_21)
Epoch 1/20
2747/2747 [==============================] - 6s 2ms/step - loss: 2138.2908 - accuracy: 0.7376 - val_loss: 1476.4219 - val_accuracy: 0.7299
Epoch 2/20
2747/2747 [==============================] - 5s 2ms/step - loss: 613.0333 - accuracy: 0.7358 - val_loss: 137.9308 - val_accuracy: 0.7333
Epoch 3/20
2747/2747 [==============================] - 5s 2ms/step - loss: 55.6422 - accuracy: 0.7497 - val_loss: 10.7381 - val_accuracy: 0.7988
Epoch 4/20
2747/2747 [==============================] - 5s 2ms/step - loss: 1.7425 - accuracy: 0.8024 - val_loss: 4.2925 - val_accuracy: 0.7941
Epoch 5/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.5384 - accuracy: 0.8012 - val_loss: 4.3312 - val_accuracy: 0.7948
Epoch 6/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4993 - accuracy: 0.8013 - val_loss: 4.6568 - val_accuracy: 0.7944
Epoch 7/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.5149 - accuracy: 0.8009 - val_loss: 3.6265 - val_accuracy: 0.7941
Epoch 8/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4990 - accuracy: 0.8009 - val_loss: 3.4404 - val_accuracy: 0.7944
Epoch 9/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.7729 - accuracy: 0.8006 - val_loss: 2.9655 - val_accuracy: 0.7944
Epoch 10/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.5057 - accuracy: 0.8007 - val_loss: 3.0504 - val_accuracy: 0.7944
Epoch 11/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.5249 - accuracy: 0.8007 - val_loss: 2.5144 - val_accuracy: 0.7944
Epoch 12/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5146 - val_accuracy: 0.7944
Epoch 13/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5144 - val_accuracy: 0.7944
Epoch 14/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5147 - val_accuracy: 0.7944
Epoch 15/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5144 - val_accuracy: 0.7944
Epoch 16/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5146 - val_accuracy: 0.7944
Epoch 17/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5146 - val_accuracy: 0.7944
Epoch 18/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4995 - accuracy: 0.8006 - val_loss: 2.5143 - val_accuracy: 0.7944
Epoch 19/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4995 - accuracy: 0.8006 - val_loss: 2.5144 - val_accuracy: 0.7944
Epoch 20/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4996 - accuracy: 0.8006 - val_loss: 2.5143 - val_accuracy: 0.7944
92/92 [==============================] - 0s 1ms/step
auc neural network : 0.49918032786885247
Code
NN(df_21, scale = True)
Epoch 1/20
2747/2747 [==============================] - 6s 2ms/step - loss: 0.4624 - accuracy: 0.8001 - val_loss: 0.4479 - val_accuracy: 0.8012
Epoch 2/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4418 - accuracy: 0.8037 - val_loss: 0.4391 - val_accuracy: 0.8043
Epoch 3/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4339 - accuracy: 0.8026 - val_loss: 0.4369 - val_accuracy: 0.8087
Epoch 4/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4287 - accuracy: 0.8028 - val_loss: 0.4340 - val_accuracy: 0.8046
Epoch 5/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4242 - accuracy: 0.8062 - val_loss: 0.4367 - val_accuracy: 0.8002
Epoch 6/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4185 - accuracy: 0.8116 - val_loss: 0.4358 - val_accuracy: 0.8063
Epoch 7/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4208 - accuracy: 0.8092 - val_loss: 0.4265 - val_accuracy: 0.8101
Epoch 8/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4136 - accuracy: 0.8089 - val_loss: 0.4313 - val_accuracy: 0.8043
Epoch 9/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4110 - accuracy: 0.8112 - val_loss: 0.4290 - val_accuracy: 0.8077
Epoch 10/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4083 - accuracy: 0.8145 - val_loss: 0.4233 - val_accuracy: 0.8070
Epoch 11/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4049 - accuracy: 0.8148 - val_loss: 0.4332 - val_accuracy: 0.8036
Epoch 12/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.4055 - accuracy: 0.8178 - val_loss: 0.4258 - val_accuracy: 0.8033
Epoch 13/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3987 - accuracy: 0.8176 - val_loss: 0.4196 - val_accuracy: 0.8067
Epoch 14/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3975 - accuracy: 0.8180 - val_loss: 0.4176 - val_accuracy: 0.8097
Epoch 15/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3943 - accuracy: 0.8181 - val_loss: 0.4207 - val_accuracy: 0.8073
Epoch 16/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3929 - accuracy: 0.8205 - val_loss: 0.4228 - val_accuracy: 0.8090
Epoch 17/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3915 - accuracy: 0.8199 - val_loss: 0.4181 - val_accuracy: 0.8118
Epoch 18/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3878 - accuracy: 0.8194 - val_loss: 0.4392 - val_accuracy: 0.8053
Epoch 19/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3876 - accuracy: 0.8234 - val_loss: 0.4263 - val_accuracy: 0.8067
Epoch 20/20
2747/2747 [==============================] - 5s 2ms/step - loss: 0.3835 - accuracy: 0.8230 - val_loss: 0.4389 - val_accuracy: 0.8019
92/92 [==============================] - 0s 1ms/step
auc neural network : 0.7834776714849622
Code
NN(df_21, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8158561254161574 with {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
NN(df_21, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8160619214182365 with {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}

Dataframe 2

Logistic Regression
Code
LR(df_22)
auc regression logistic : 0.7337923953327186
Code
LR(df_22, True)
auc regression logistic : 0.7337213665030331
Code
LR(df_22, scale = False, gridsearch = True)
auc logistic regression with best parameters : 0.733862994552547 with {'max_iter': 2000, 'random_state': 42, 'solver': 'liblinear'}
Code
LR(df_22, scale = True, gridsearch=True)
auc logistic regression with best parameters : 0.7337213665030331 with {'max_iter': 2000, 'random_state': 42, 'solver': 'lbfgs'}
Random Forest
Code
RF(df_22)
auc random forest : 0.8282791404365982
Code
RF(df_22, scale = True)
auc random forest : 0.8285891755500437
Code
RF(df_22, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8318229921467318 with {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
RF(df_22, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8316275196618113 with {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Neural Networks
Code
NN(df_22)
Epoch 1/20
3042/3042 [==============================] - 6s 2ms/step - loss: 1325.3367 - accuracy: 0.7221 - val_loss: 52.0162 - val_accuracy: 0.7581
Epoch 2/20
3042/3042 [==============================] - 6s 2ms/step - loss: 10.5787 - accuracy: 0.7884 - val_loss: 0.9677 - val_accuracy: 0.7925
Epoch 3/20
3042/3042 [==============================] - 6s 2ms/step - loss: 1.1593 - accuracy: 0.7942 - val_loss: 0.5102 - val_accuracy: 0.7940
Epoch 4/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5071 - accuracy: 0.7954 - val_loss: 0.5101 - val_accuracy: 0.7940
Epoch 5/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5078 - accuracy: 0.7953 - val_loss: 0.5083 - val_accuracy: 0.7940
Epoch 6/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5083 - val_accuracy: 0.7940
Epoch 7/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 8/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 9/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 10/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5083 - val_accuracy: 0.7940
Epoch 11/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 12/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5083 - val_accuracy: 0.7940
Epoch 13/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 14/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5068 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 15/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 16/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 17/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5083 - val_accuracy: 0.7940
Epoch 18/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7953 - val_loss: 0.5082 - val_accuracy: 0.7940
Epoch 19/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.7176 - accuracy: 0.7949 - val_loss: 0.5084 - val_accuracy: 0.7940
Epoch 20/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5238 - accuracy: 0.7953 - val_loss: 0.5085 - val_accuracy: 0.7940
102/102 [==============================] - 0s 1ms/step
auc neural network : 0.5007342143906021
Code
NN(df_22, scale = True)
Epoch 1/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4611 - accuracy: 0.8001 - val_loss: 0.4421 - val_accuracy: 0.8014
Epoch 2/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4348 - accuracy: 0.8042 - val_loss: 0.4379 - val_accuracy: 0.7971
Epoch 3/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4230 - accuracy: 0.8091 - val_loss: 0.4361 - val_accuracy: 0.8076
Epoch 4/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4120 - accuracy: 0.8170 - val_loss: 0.4243 - val_accuracy: 0.8122
Epoch 5/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3998 - accuracy: 0.8224 - val_loss: 0.4150 - val_accuracy: 0.8115
Epoch 6/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3924 - accuracy: 0.8276 - val_loss: 0.4065 - val_accuracy: 0.8186
Epoch 7/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3826 - accuracy: 0.8313 - val_loss: 0.4034 - val_accuracy: 0.8171
Epoch 8/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3780 - accuracy: 0.8324 - val_loss: 0.4063 - val_accuracy: 0.8189
Epoch 9/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3705 - accuracy: 0.8336 - val_loss: 0.4165 - val_accuracy: 0.8158
Epoch 10/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3694 - accuracy: 0.8330 - val_loss: 0.4071 - val_accuracy: 0.8146
Epoch 11/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3635 - accuracy: 0.8359 - val_loss: 0.4063 - val_accuracy: 0.8165
Epoch 12/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3635 - accuracy: 0.8366 - val_loss: 0.4162 - val_accuracy: 0.8106
Epoch 13/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3584 - accuracy: 0.8387 - val_loss: 0.4427 - val_accuracy: 0.8140
Epoch 14/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3569 - accuracy: 0.8408 - val_loss: 0.4099 - val_accuracy: 0.8192
Epoch 15/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3534 - accuracy: 0.8410 - val_loss: 0.4209 - val_accuracy: 0.8235
Epoch 16/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3481 - accuracy: 0.8413 - val_loss: 0.4127 - val_accuracy: 0.8235
Epoch 17/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3462 - accuracy: 0.8408 - val_loss: 0.4150 - val_accuracy: 0.8183
Epoch 18/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3443 - accuracy: 0.8423 - val_loss: 0.4423 - val_accuracy: 0.8152
Epoch 19/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3415 - accuracy: 0.8439 - val_loss: 0.4385 - val_accuracy: 0.8263
Epoch 20/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3438 - accuracy: 0.8442 - val_loss: 0.4168 - val_accuracy: 0.8290
102/102 [==============================] - 0s 1ms/step
auc neural network : 0.8065809304757641
Code
NN(df_22, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8284646773956521 with {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
NN(df_22, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8284211029962099 with {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}

Dataframe 3

Logistic Regression
Code
LR(df_23)
auc regression logistic : 0.5044410202947697
Code
LR(df_23, True)
auc regression logistic : 0.739775428607434
Code
LR(df_23, scale = False, gridsearch = True)
auc logistic regression with best parameters : 0.5044410202947697 with {'max_iter': 2000, 'random_state': 42, 'solver': 'lbfgs'}
Code
LR(df_23, scale = True, gridsearch=True)
auc logistic regression with best parameters : 0.739775428607434 with {'max_iter': 2000, 'random_state': 42, 'solver': 'lbfgs'}
Random Forest
Code
RF(df_23)
auc random forest : 0.8321952490877951
Code
RF(df_23, scale = True)
auc random forest : 0.8318928037484892
Code
RF(df_23, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8350875974498359 with {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
RF(df_23, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8350704130555572 with {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Neural Networks
Code
NN(df_23)
Epoch 1/20
3042/3042 [==============================] - 6s 2ms/step - loss: 465218112.0000 - accuracy: 0.6831 - val_loss: 74972016.0000 - val_accuracy: 0.7940
Epoch 2/20
3042/3042 [==============================] - 6s 2ms/step - loss: 92365392.0000 - accuracy: 0.6812 - val_loss: 19476078.0000 - val_accuracy: 0.7940
Epoch 3/20
3042/3042 [==============================] - 6s 2ms/step - loss: 18728034.0000 - accuracy: 0.6762 - val_loss: 258.5468 - val_accuracy: 0.7940
Epoch 4/20
3042/3042 [==============================] - 6s 2ms/step - loss: 115.1466 - accuracy: 0.7952 - val_loss: 203.4580 - val_accuracy: 0.7940
Epoch 5/20
3042/3042 [==============================] - 6s 2ms/step - loss: 90.6989 - accuracy: 0.7951 - val_loss: 134.5698 - val_accuracy: 0.7940
Epoch 6/20
3042/3042 [==============================] - 6s 2ms/step - loss: 65.6811 - accuracy: 0.7940 - val_loss: 124.6895 - val_accuracy: 0.7940
Epoch 7/20
3042/3042 [==============================] - 6s 2ms/step - loss: 54.2912 - accuracy: 0.7953 - val_loss: 91.1641 - val_accuracy: 0.7916
Epoch 8/20
3042/3042 [==============================] - 6s 2ms/step - loss: 6551257.0000 - accuracy: 0.7859 - val_loss: 40.1850 - val_accuracy: 0.7940
Epoch 9/20
3042/3042 [==============================] - 6s 2ms/step - loss: 47.8869 - accuracy: 0.7951 - val_loss: 17.2153 - val_accuracy: 0.7940
Epoch 10/20
3042/3042 [==============================] - 6s 2ms/step - loss: 1221.1851 - accuracy: 0.7952 - val_loss: 15.1492 - val_accuracy: 0.7940
Epoch 11/20
3042/3042 [==============================] - 6s 2ms/step - loss: 9.8182 - accuracy: 0.7952 - val_loss: 12.2419 - val_accuracy: 0.7940
Epoch 12/20
3042/3042 [==============================] - 6s 2ms/step - loss: 8.2885 - accuracy: 0.7952 - val_loss: 6.9110 - val_accuracy: 0.7940
Epoch 13/20
3042/3042 [==============================] - 6s 2ms/step - loss: 6.4688 - accuracy: 0.7952 - val_loss: 0.5070 - val_accuracy: 0.7940
Epoch 14/20
3042/3042 [==============================] - 6s 2ms/step - loss: 2.9817 - accuracy: 0.7952 - val_loss: 0.5077 - val_accuracy: 0.7940
Epoch 15/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.7659 - accuracy: 0.7953 - val_loss: 5.3794 - val_accuracy: 0.7931
Epoch 16/20
3042/3042 [==============================] - 6s 2ms/step - loss: 11075.4990 - accuracy: 0.7941 - val_loss: 0.5839 - val_accuracy: 0.7937
Epoch 17/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5070 - accuracy: 0.7951 - val_loss: 0.5839 - val_accuracy: 0.7937
Epoch 18/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5069 - accuracy: 0.7951 - val_loss: 0.5840 - val_accuracy: 0.7937
Epoch 19/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5070 - accuracy: 0.7951 - val_loss: 0.5839 - val_accuracy: 0.7937
Epoch 20/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.5070 - accuracy: 0.7951 - val_loss: 0.5839 - val_accuracy: 0.7937
102/102 [==============================] - 0s 1ms/step
auc neural network : 0.5011636927851048
Code
NN(df_23, scale = True)
Epoch 1/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4616 - accuracy: 0.8005 - val_loss: 0.4461 - val_accuracy: 0.8048
Epoch 2/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4359 - accuracy: 0.8055 - val_loss: 0.4480 - val_accuracy: 0.7974
Epoch 3/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4275 - accuracy: 0.8076 - val_loss: 0.4441 - val_accuracy: 0.8048
Epoch 4/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4188 - accuracy: 0.8095 - val_loss: 0.4457 - val_accuracy: 0.8072
Epoch 5/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4126 - accuracy: 0.8128 - val_loss: 0.4326 - val_accuracy: 0.8020
Epoch 6/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.4053 - accuracy: 0.8161 - val_loss: 0.4194 - val_accuracy: 0.8143
Epoch 7/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3962 - accuracy: 0.8213 - val_loss: 0.4172 - val_accuracy: 0.8149
Epoch 8/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3861 - accuracy: 0.8253 - val_loss: 0.4152 - val_accuracy: 0.8082
Epoch 9/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3741 - accuracy: 0.8296 - val_loss: 0.4044 - val_accuracy: 0.8214
Epoch 10/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3657 - accuracy: 0.8345 - val_loss: 0.4087 - val_accuracy: 0.8152
Epoch 11/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3602 - accuracy: 0.8349 - val_loss: 0.3991 - val_accuracy: 0.8214
Epoch 12/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3572 - accuracy: 0.8374 - val_loss: 0.4009 - val_accuracy: 0.8247
Epoch 13/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3534 - accuracy: 0.8366 - val_loss: 0.4063 - val_accuracy: 0.8192
Epoch 14/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3474 - accuracy: 0.8397 - val_loss: 0.3986 - val_accuracy: 0.8214
Epoch 15/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3542 - accuracy: 0.8423 - val_loss: 0.4284 - val_accuracy: 0.8198
Epoch 16/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3404 - accuracy: 0.8433 - val_loss: 0.4120 - val_accuracy: 0.8226
Epoch 17/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3392 - accuracy: 0.8429 - val_loss: 0.4108 - val_accuracy: 0.8211
Epoch 18/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3356 - accuracy: 0.8468 - val_loss: 0.4153 - val_accuracy: 0.8214
Epoch 19/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3344 - accuracy: 0.8455 - val_loss: 0.4164 - val_accuracy: 0.8174
Epoch 20/20
3042/3042 [==============================] - 6s 2ms/step - loss: 0.3305 - accuracy: 0.8463 - val_loss: 0.4299 - val_accuracy: 0.8189
102/102 [==============================] - 0s 1ms/step
auc neural network : 0.8053204056918988
Code
NN(df_23, scale = False, gridsearch = True)
roc random forest with best parameters : 0.8326355733422646 with {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}
Code
NN(df_23, scale = True, gridsearch = True)
roc random forest with best parameters : 0.8325210837437301 with {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 42}

Comparative table

Code
arrays = [np.array(["Logistic Regression", "Logistic Regression","Logistic Regression","Logistic Regression",
                    "Random Forest", "Random Forest", "Random Forest", "Random Forest",
                    "Neural Networks", "Neural Networks", "Neural Networks", "Neural Networks"]), 
          np.array(["No scale, no gridsearch", "Scale", "Gridsearch", "Scale + gridsearch",
                    "No scale, no gridsearch", "Scale", "Gridsearch", "Scale + gridsearch",
                    "No scale, no gridsearch", "Scale", "Gridsearch", "Scale + gridsearch",]),]

df_to_compare_model = pd.DataFrame(np.array([
                                             [0.74036,0.73379,0.50444],
                                             [0.73917,0.73372,0.73978],
                                             [0.74041,0.73386,0.50444],
                                             [0.73917,0.73372,0.73978],
                                             [0.81958,0.82828,0.83220],
                                             [0.81957,0.82859,0.83189],
                                             [0.82116,0.83182,0.83509],
                                             [0.82129,0.83163,0.83507],
                                             [0.49918,0.50073,0.50116],
                                             [0.78348,0.80658,0.80532],
                                             [0.81586,0.82846,0.83264],
                                             [0.81606,0.82842,0.83252],
                                             ]), 
                                    index = arrays, columns=['Datframe 1', 'Dataframe 2', 'Dataframe 3'])

df_to_compare_model
Datframe 1 Dataframe 2 Dataframe 3
Logistic Regression No scale, no gridsearch 0.74036 0.73379 0.50444
Scale 0.73917 0.73372 0.73978
Gridsearch 0.74041 0.73386 0.50444
Scale + gridsearch 0.73917 0.73372 0.73978
Random Forest No scale, no gridsearch 0.81958 0.82828 0.83220
Scale 0.81957 0.82859 0.83189
Gridsearch 0.82116 0.83182 0.83509
Scale + gridsearch 0.82129 0.83163 0.83507
Neural Networks No scale, no gridsearch 0.49918 0.50073 0.50116
Scale 0.78348 0.80658 0.80532
Gridsearch 0.81586 0.82846 0.83264
Scale + gridsearch 0.81606 0.82842 0.83252
Code
max_value = df_to_compare_model.values.max()  # Get the max value
max_indices = np.where(df_to_compare_model.values == max_value)  

max_rows, max_columns = max_indices

print(f"The maximum value is {max_value} with:")
for row, column in zip(max_rows, max_columns):
    print(df_to_compare_model.index[row], df_to_compare_model.columns[column])
The maximum value is 0.83509 with:
('Random Forest', 'Gridsearch') Dataframe 3